How to design OnCall | Prskavec.Net

Guide to good On-call that does not kill your team

You can read quotes like this across the cloud industry:

Some teams have had dramatically different pager volumes for years at my company. There are teams which people like to work on thanks to their great mission, but those same teams have a noisy oncall. There is no way I would move to those teams, where moving would come with so much unpaid overtime at unsociable hours.” From a publicly traded tech company valued at $13B.

Making good on-call across various teams isn’t an easy job. It would be best if you had great support from senior executives. The team manager needs to know where to push for hiring and how to make priorities for the team to avoid toil that is not manageable.

We don’t cover details about incident response for engineers being on-call. Please read the excellent documentation from PagerDuty to familiarize yourself with the more detailed guide on working during an incident¹.

How to Introduce On-Call into Your Company?

Introducing on-call is a major culture change. If you have never done it before, it significantly impacts your personal life and brings a level of stress you didn’t have before. The company should discuss it with people upfront and include it in contracts where required. It isn’t ideal if on-call comes as a surprise or is sprung on someone after their probation period.

If you are a small startup, it’s good to start during working hours only. Focus on good observability and alerting with a proper paging workflow, and give ownership to the developers working on the service. This approach is called “You build it, you run it.” ²

I recommend taking your time with changes. You need to ensure that everyone who joins on-call fully understands what that extra work entails. Give them proper training using gamedays ³. I recommend having a training environment where everyone goes through a few simulated incidents before going live. A well-mastered process is more important than having the right solution — for solutions, you can always summon engineers from across the company. Communicating with all stakeholders and controlling what happens is critical to good incident management.

How Many People Do You Need For On-call?

“Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month. For dual-site teams, a reasonable minimum size of each team is six, both to honor the 25% rule and to ensure a substantial and critical mass of engineers for the team.” - SRE book ⁴

As explained in the quote above, Google recommends a minimum of 8 engineers for one site and six engineers per site in the case of two sites. My calculations show that you could have fewer engineers on-call if they are divided between two or three locations, but you have to always keep in mind vacations, sick days, business travels, etc.

I remember making a full on-call rotation with six people working 24/7/365 — it was exhausting and significantly impacted people’s personal lives. As an SRE manager, I always try to focus on achieving good work & time off balance for people being a part of an on-call rotation. The burden of providing non-stop support can be a significant reason why people leave engineering teams.

PagerDuty Incident Response process. It is a cut-down version of our internal documentation used at PagerDuty for any major incidents and to prepare new employees for on-call responsibilities - https://response.pagerduty.com/ ↩︎
“You build it, you run it” is a software development principle that emphasizes the responsibility of the development team in designing, building, and maintaining the systems they create. ↩︎
Gamedays - So, what exactly is a GameDay? A GameDay is a period of time (usually 1—4 hours) set aside for a team to run one or more experiments on a system, process or service, observe the impact, then discuss the outcomes. ↩︎
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems (1st. ed.). O’Reilly Media, Inc. - https://landing.google.com/sre/sre-book/chapters/being-on-call/ ↩︎

Incident process tooling

We can split tools into multiple categories: pagers, incident response, incident workflows, utility, and postmortems. More and more tools are trying to be universal and cover several of these areas. What matters is finding what suits you best and how to combine them effectively. You can find some overview tools but we will focus more about workflow and what you need. If you get the incident process as process workflow, you see these essential parts that you need to cover with your tooling.

Daily Oncall Routine

For daily on-call support, I recommend a few things. Make everything related to your on-call visible in one place. I recommend a dedicated Slack channel for on-call. If you funnel all notifications and alerts into that channel, it becomes easy to see when something is happening. It’s best to also have a dedicated channel for major incidents so that communication isn’t overwhelmed by automated messages. Some tools support this out of the box.

Tools table overview

Last updated: March 2026 — Pricing and status verified against vendor websites. Prices are approximate and may change; always check the vendor’s current pricing page before making decisions. Category Other Functions Name URL Notes Company Info Base Price (annual) Pagers Incident Workflows PagerDuty https://www.pagerduty.com/ Long on market, strong reliability; now includes AIOps and AI agents. PagerDuty Inc, NYSE: PD ~$21/user/month (On-Call plan) Pagers OpsGenie https://www.atlassian.com/software/opsgenie ⚠️ EOL — Atlassian announced shutdown April 2027; migrate to Jira Service Management. Atlassian Corporation, NASDAQ: TEAM ~$23/user/month (was); do not start new Pagers Splunk On-Call https://www.splunk.com/en_us/products/on-call.html Former VictorOps; Splunk acquired by Cisco in 2024. Pricing via sales only. Cisco (acquired Splunk 2024) N/A — contact sales Pagers Grafana OnCall https://grafana.com/products/oncall/ Open-source self-hosted option available; bundled with Grafana Cloud Pro/Enterprise. Grafana Labs Free (OSS) / bundled with Grafana Cloud Pagers PagerTree https://pagertree.com/ ~$10/user/month Pagers Iris https://github.com/linkedin/iris LinkedIn open-source on-call and escalation tool; self-hosted only. OSS (LinkedIn) Free (self-hosted) Pagers Incident Workflows Rootly https://rootly.com Slack-native incident response; expanded into on-call management (2023). Raised $24M. ~$32M raised total (2023) Per-active-user; contact sales Pagers Incident Workflows incident.io https://incident.io/ Slack-native incident management; added on-call scheduling (2023). ~$34M raised (2022) ~$16/user/month (Team plan) Pagers Incident Workflows Squadcast https://www.squadcast.com/ On-call + incident response platform; strong SRE workflow focus. Free tier; ~$9/user/month (Pro) Pagers Zenduty https://www.zenduty.com/ On-call with incident workflows; generous free tier. Free tier; ~$10/user/month (Pro) Pagers Datadog https://www.datadoghq.com/blog/incident-response-with-datadog/ Incident management bundled with Datadog observability platform. Datadog Inc, NASDAQ: DDOG Bundled with Datadog platform Pagers Utility Better Stack https://betterstack.com/ All-in-one: on-call, uptime monitoring, status pages, and log management. ~$28.6M raised (2024) ~$29/user/month (Responder license) Incident Response Incident Workflows FireHydrant https://firehydrant.com/ Incident orchestration and retrospectives; flat annual pricing. ~$9,600/year (flat) Incident Response Postmortems Jeli https://www.pagerduty.com/ Acquired by PagerDuty (2023); now integrated into PagerDuty platform. Part of PagerDuty Included in PagerDuty plans Pagers Incident Workflows Jira Service Management https://www.atlassian.com/software/jira/service-management Official Atlassian replacement for OpsGenie; includes alerting and on-call in Premium tier. Atlassian Corporation, NASDAQ: TEAM Free; Premium ~$44/user/month Pagers ilert https://www.ilert.com/ AI-first on-call and alerting platform; popular OpsGenie alternative; hosted in Germany (GDPR). ilert GmbH Free (≤5 users); Pro from ~$9/user/month Pagers xMatters https://www.xmatters.com/ Enterprise-grade IT alerting; acquired by Everbridge (2021). #1 on G2 IT Alerting list. Everbridge (acquired xMatters 2021) Contact sales Pagers Datadog On-Call https://www.datadoghq.com/product/on-call/ Dedicated on-call product launched by Datadog (2024); integrates with Datadog observability. Datadog Inc, NASDAQ: DDOG Bundled with Datadog platform Pagers AlertOps https://alertops.com/ PagerDuty alternative with generous free tier; strong escalation and routing features. Free (≤5 users); Pro ~$15/user/month Utility Prometheus AlertManager https://prometheus.io/docs/alerting/latest/alertmanager/ OSS alerting component for Prometheus; widely used in Kubernetes/cloud-native stacks. OSS (CNCF) Free (self-hosted) Utility Backstage https://backstage.io/ Open-source developer portal; integrates on-call and incident tools into unified views. OSS (Spotify) Free (self-hosted) Utility Statuspage https://www.atlassian.com/software/statuspage Status page tool; separate from OpsGenie (not affected by OpsGenie EOL). Atlassian Corporation, NASDAQ: TEAM From ~$79/month Utility Status.io https://status.io/ Hosted status page service. From ~$79/month

Postmortems

Postmortems are necessary for learning and should be held whenever an incident occurs 1. You can determine if you need every incident in some short description or skip it for low-severity ones. You must always be sure that nobody is pointing fingers and the whole process is blameless. Best Practice: Avoid Blame and Keep It Constructive 2 Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization

Summary

Designing on-call is a tough process with constantly changing conditions. Team size shifts frequently — you add people during hiring and lose them when they leave — and planning around that is never easy. We don’t cover handover times for sites and what is recommended in detail here. The SRE book recommends Wednesday for weekly shift changes, but I suggest Sunday or Monday instead. We use Sunday for reporting purposes and hold an incident review on Monday, which works well for us. Some teams prefer Wednesday to avoid overlap with development planning or other recurring company meetings.