How to design OnCall | Prskavec.Net

Guide to good On-call that does not kill your team

You can read quotes as this across the whole cloud industry:

Some teams have had dramatically different pager volumes for years at my company. There are teams which people like to work on thanks to their great mission, but those same teams have a noisy oncall. There is no way I would move to those teams, where moving would come with so much unpaid overtime at unsociable hours.” From a publicly traded tech company valued at $13B.

Making good on-call across various teams isn’t an easy job. It would be best if you had great support from senior executives. The team manager needs to know where to push for hiring and how to make priorities for the team to avoid toil that is not manageable.

We don’t cover details about incident response itself from engineers being on-call. Please read excellent documentation from Pagerduty to familiarize yourself with the more detailed guide on working during the incident¹.

How introduce On-Call into your company?

Introducing On-call is a major culture change for people. If you have never done it before, it significantly impacts your personal life, and you get a lot of stress that you don’t have before. The company should have asked people and have this option in contracts if you need it. It isn’t lovely if on-call is a surprise or you want to be promoted after your probation period.

If you are a small startup is good to start early during working hours. I will focus on good observability and alerting with proper paging workflow and get ownership to developers working on the service. This approach is called You build it; you run it. ²

I will recommend taking your time with changes. You have to be sure that all people who will join on-call fully understand what is behind that extra work. You should give them proper training using gamedays ³. I recommend having a training environment; everyone should go through a few incidents to be sure of what to do. The mastered process is more important than the solution for good on-call. For solutions, you can always summon engineers across the company if you need them. Communicating with all stakeholders and controlling what happens is critical to good incident management.

How Many People Do You Need For On-call?

“Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month. For dual-site teams, a reasonable minimum size of each team is six, both to honor the 25% rule and to ensure a substantial and critical mass of engineers for the team.” - SRE book ⁴

As explained in the quote above, Google recommends a minimum of 8 engineers for one site and six engineers per site in the case of two sites. My calculations show that you could have fewer engineers on-call if they are divided between two or three locations, but you have to always keep in mind vacations, sick days, business travels, etc.

I remember making a full on-call rotation with six people working 24/7/365 — it was exhausting and significantly impacted people’s personal lives. As an SRE manager, I always try to focus on achieving good work & time off balance for people being a part of an on-call rotation. The burden of providing non-stop support can be a significant reason why people leave engineering teams.

PagerDuty Incident Response process. It is a cut-down version of our internal documentation used at PagerDuty for any major incidents and to prepare new employees for on-call responsibilities - https://response.pagerduty.com/ ↩︎
“You build it, you run it” is a software development principle that emphasizes the responsibility of the development team in designing, building, and maintaining the systems they create. ↩︎
Gamedays - So, what exactly is a GameDay? A GameDay is a period of time (usually 1—4 hours) set aside for a team to run one or more experiments on a system, process or service, observe the impact, then discuss the outcomes. ↩︎
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems (1st. ed.). O’Reilly Media, Inc. - https://landing.google.com/sre/sre-book/chapters/being-on-call/ ↩︎

Incident process tooling

We can split tools into multiple categories as pagers, incident response, incident workflows, utility and postmortems. More and more tools are trying to be more universal and cover more of it. Here is always what suits you best and how you can combine it. You can find some overview tools but we will focus more about workflow and what you need. If you get the incident process as process workflow, you see these essential parts that you need to cover with your tooling.

Daily Oncall Routine

For support daily oncall, I recommend a few things. Make visible all that your On Call in one place. I will recommend a dedicated Slack channel for oncall. If you collect all notifications and alerts into the channel is very easy to see that something is happening. It would be best to have a dedicated channel for a major incident for communication and not be overwhelmed with automatic messages. Some tools support this out of the box.

Postmortems

Postmortems are necessary for learning and should be held whenever an incident occurs 1. You can determine if you need every incident in some short description or skip it for low-severity ones. You must always be sure that nobody is pointing fingers and the whole process is blameless. Best Practice: Avoid Blame and Keep It Constructive 2 Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization

Summary

Design on-call is a tough process that constantly changes conditions. The number of people in teams changes quite often, and you can add more people during the hiring and remove people from the team, and that isn’t easy. We don’t discuss handover time for sites and what is recommended. If you choose Wednesday as an SRE book, I recommend changing to a weekly shift or Sunday or Monday. We use Sunday for reporting purposes, and we have an incident review on Monday, but I don’t see this as a problem. Some recommend making changes on Wednesdays, not overlapping with development planning or other company regular meetings.

Tools table overview

Category Others func Name URL Notes Company Info Base price Pagers Incident Workflows PagerDuty https://www.pagerduty.com/ Long on market, very good reliability and expanding tools to others categories. Pagerduty Inc, NYSE: PD $25/month/seat Pagers OpsGenie https://www.atlassian.com/software/opsgenie Atlassian Corporation, TEAM (NASDAQ) $23/month/seat Pagers Splunk On-Call https://www.splunk.com/en_us/products/on-call.html former VictorOps N/A Pagers Grafana OnCall https://grafana.com/products/oncall/ Incident Response & Management (IRM) Grafana Labs $20/month/seat Pagers PagerTree https://pagertree.com/ $10/month/seat Pagers Iris https://iris.claims/ OSS LinkedIn Pagers Incident Workflows Rootly https://rootly.com Expanded function to pagers (2023) $15 total, last round 2023 N/A Pagers Incident Workflows incident.io https://incident.io/ Expanded function to pagers (2023) $34, last round 2022 $25/month/seat Incident Response Incident Workflows FireHydrant https://firehydrant.com/ $6000/year Incident Response Jeli Slack bot https://www.jeli.io/slack-app Acquired by PagerDuty 2023 Utility Backstage https://backstage.io/ Developer portal you can integrate all you tools into unfied views Utility Statuspage https://www.atlassian.com/software/statuspage Atlassian Corporation, TEAM (NASDAQ) Utility Status.io https://status.io/ Postmortems Jeli https://www.jeli.io/ Acquired by PagerDuty 2023 Pagers zenduty https://www.zenduty.com/ Pagers Datadog https://www.datadoghq.com/blog/incident-response-with-datadog/ Pagers Utility Betterstack https://betterstack.com/ Message replacing PagerDuty, Statuspage and Pingdom for half price $28.6, last round 2024 $230/month for 6 team members and 200 monitors