How to design OnCall

Guide to good On-call that does not kill your team

You can read quotes like this across the cloud industry:

Some teams have had dramatically different pager volumes for years at my company. There are teams which people like to work on thanks to their great mission, but those same teams have a noisy oncall. There is no way I would move to those teams, where moving would come with so much unpaid overtime at unsociable hours.” From a publicly traded tech company valued at $13B.

Making good on-call across various teams isn’t an easy job. It would be best if you had great support from senior executives. The team manager needs to know where to push for hiring and how to make priorities for the team to avoid toil that is not manageable.

We don’t cover details about incident response for engineers being on-call. Please read the excellent documentation from PagerDuty to familiarize yourself with the more detailed guide on working during an incident1.

How to Introduce On-Call into Your Company?

Introducing on-call is a major culture change. If you have never done it before, it significantly impacts your personal life and brings a level of stress you didn’t have before. The company should discuss it with people upfront and include it in contracts where required. It isn’t ideal if on-call comes as a surprise or is sprung on someone after their probation period.

If you are a small startup, it’s good to start during working hours only. Focus on good observability and alerting with a proper paging workflow, and give ownership to the developers working on the service. This approach is called “You build it, you run it.” 2

I recommend taking your time with changes. You need to ensure that everyone who joins on-call fully understands what that extra work entails. Give them proper training using gamedays 3. I recommend having a training environment where everyone goes through a few simulated incidents before going live. A well-mastered process is more important than having the right solution — for solutions, you can always summon engineers from across the company. Communicating with all stakeholders and controlling what happens is critical to good incident management.

How Many People Do You Need For On-call?

“Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month. For dual-site teams, a reasonable minimum size of each team is six, both to honor the 25% rule and to ensure a substantial and critical mass of engineers for the team.” - SRE book 4

As explained in the quote above, Google recommends a minimum of 8 engineers for one site and six engineers per site in the case of two sites. My calculations show that you could have fewer engineers on-call if they are divided between two or three locations, but you have to always keep in mind vacations, sick days, business travels, etc.

I remember making a full on-call rotation with six people working 24/7/365 — it was exhausting and significantly impacted people’s personal lives. As an SRE manager, I always try to focus on achieving good work & time off balance for people being a part of an on-call rotation. The burden of providing non-stop support can be a significant reason why people leave engineering teams.


  1. PagerDuty Incident Response process. It is a cut-down version of our internal documentation used at PagerDuty for any major incidents and to prepare new employees for on-call responsibilities - https://response.pagerduty.com/ ↩︎

  2. “You build it, you run it” is a software development principle that emphasizes the responsibility of the development team in designing, building, and maintaining the systems they create. ↩︎

  3. Gamedays - So, what exactly is a GameDay? A GameDay is a period of time (usually 1—4 hours) set aside for a team to run one or more experiments on a system, process or service, observe the impact, then discuss the outcomes. ↩︎

  4. Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems (1st. ed.). O’Reilly Media, Inc. - https://landing.google.com/sre/sre-book/chapters/being-on-call/ ↩︎

Roster

The roster is a term used for on-call planning, which describes a group of people who are part of a single on-call rotation. The number of roster layers you have for escalations depends on your company’s standards, industry requirements, and the particular composition of your team. The most commonly used setup includes a primary and secondary on-call in which the role of the secondary is to step up if the primary is overwhelmed. Typically, engineers in both rosters have similar skills.

High Alert ratio

One of the most challenging problems of on-call design is controlling the number of times engineers are paged during the night. This is especially critical for single-site teams. A high number of pages can cause alarm fatigue. Having people answer pages at 2 AM only to realize they are responding to something that isn’t urgent or actionable can quickly damage team morale. You need to fine-tune alerting to ensure that anything that isn’t urgent or critical waits until the on-call engineer wakes up. Sometimes this may require increasing alarm thresholds. In other cases, it may require code changes to improve error handling. While this may seem like additional work, it helps ensure that on-call work is sustainable and protects your team from burnout.

Primary vs secondary

When you have a primary and secondary on-call, several design decisions shape how effective the arrangement is. Here are the key questions to answer for your team. Should the same people rotate through both rosters? The most common approach is to rotate the same pool of engineers through both primary and secondary slots, just offset by one week. This keeps the secondary familiar with the current state of the system and means everyone shares the on-call burden equally.

Daily vs Weekly

My example uses weekly shifts, though that doesn’t mean I prefer them over daily shifts. Weekly shifts are not inherently better or worse. I think that if you are a single-site team, daily shifts are better. You can spread weekend coverage across six people and handle the whole week that way. For 2-3 sites, weekly shifts are more common. A strategy named “follow the sun” (FTS 1) is usually defined by omitting the night shift from on-call coverage, which can be done with 2 or 3 sites. You will have many handovers, and this is something you need to prepare carefully.

Working vs non-working hours

In my example, I calculate how many non-working hours people spend on-call. At least in the EU, you are required to pay extra for this 1, and it’s the right thing to do. That said, many people value not working on weekends more than the extra pay. Sleep deprivation is a serious problem, and as a manager, you need to think carefully about this. 2 I have a colleague whose company gives one day of vacation for each week of on-call instead of monetary compensation — and he prefers it that way.

Escalations & Metrics

The primary and secondary layers are for resolving and managing incidents. You need an escalation process for a few reasons. interchangeability for any reason (the primary having problem accessing the internet, traveling, oversleeping, etc.) nobody responds on primary/secondary, and you need to address this (manager on-call) problems that can’t be resolved by primary/secondary (security, the legal, executive decision need it) Testing all escalations is always good, especially with executives who aren’t paged often. It is always good if all stakeholders have enough information about current incidents and don’t step into the process and ask the Incident commander for information.

Weekly oncall

For an overview of on-call, I recommend reading Chapter 14 on On-call 1. But now to my example: We have team with primary and secondary roster We can find what is best for us using 1 - 3 sites We are looking how many people we need and how we calculate costs We don’t try to focus on daily shifts that will make example more complicated We don’t use combinations as primary in daily and secondary weekly that make example more complicated too We’re skipping many possible combinations, don’t think that is only good way how to do it, this is more about how you think for your use case Some numbers first — calculate and consider what matters most to you, and whether you have enough people to reach at least this minimum with your conditions. My example uses two rosters, and we can choose between 1-3 sites.

OnCall Market

An on-call market is a structured, team-managed process for swapping shifts. Without one, swaps happen ad hoc — through direct messages, verbal agreements, or changes made directly in the pager tool by someone other than the engineer on duty. These informal swaps are fragile: they get forgotten, they create coverage gaps, and they make it hard to audit who was actually responsible for an incident. A well-run on-call market solves this by making swaps visible, fair, and easy to complete safely.

Reporting

Over the years, I have found the following reports most valuable for On-Call owners to prepare: Number of incidents per week per team — useful for a regular weekly review. Number of people who had incidents in the last month, quarter, or six months. Number of people who had incidents at night in the last month, quarter, or six months. Total hours people spent on-call in the last six months. Postmortem tooling can provide another set of reports, but your pager software should give you the most insight into how people are doing. Other reports help you spot outliers. You can identify when something isn’t working or is trending in the wrong direction. Early on, you may have too many incidents; years later, you may face the opposite problem — too few, meaning engineers are losing practice. If some people only face incidents for part of the year, you should invest more in game days and training.

On-Call Documentation

Writing documentation effectively can be achieved by following a framework such as Diátaxis. Diátaxis is a way of thinking about and doing documentation. We will apply a similar approach and split documentation into four parts. Tutorials How-to Guides Runbooks / Playbooks (Google SRE naming) How to create an effective runbook Glossaries Explanation Tutorials What You Need for On-Call - A tutorial that helps new on-call members onboard into the process, tools, and gain all necessary access for the on-call shift.