On-Call Documentation

Writing documentation effectively can be achieved by following a framework such as Diátaxis. Diátaxis is a way of thinking about and doing documentation. We will apply similar approach and split documentation into four parts.

Tutorials

What You Need for On-Call - A tutorial that helps new on-call members onboard into the process, tools, and gain all necessary access for the on-call shift.

PagerDuty Incident Response

How-to Guides

Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is straightforward to understand.

You can start with a checklist as mentioned in the SRE Workbook¹.

Administering production jobs
Understanding debugging info
“Draining” traffic away from a cluster
Rolling back a bad software push
Blocking or rate-limiting unwanted traffic
Bringing up additional serving capacity
Using the monitoring systems (for alerting and dashboards)
Describing the architecture, various components, and dependencies of the services

Before going on-call, the team reviewed precise guidelines about the responsibilities of on-call engineers.

For example:

At the start of each shift, the on-call engineer reads the handoff from the previous shift.
The on-call engineer minimizes user impact first, then makes sure the issues are fully addressed.
At the end of the shift, the on-call engineer sends a handoff email to the next engineer on-call.

Runbooks / Playbooks (Google SRE naming)

As The Site Reliability Workbook¹ says, playbooks “reduce stress, the mean time to repair (MTTR), and the risk of human error.”

All alerts should be immediately actionable. There should be an action we expect a human to take immediately after they receive the page that the system is unable to take itself. The signal-to-noise ratio should be high to ensure few false positives; a low signal-to-noise ratio raises the risk for on-call engineers to develop alert fatigue.

Just like new code, new alerts should be thoroughly and thoughtfully reviewed. Each alert should have a corresponding playbook (runbook) entry.

Example of runbook template

How to create an effective runbook

There are five attributes of any good runbook; the five As.² It must be:

Actionable. It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation.
Accessible. If you can’t find the runbook, it doesn’t matter how well it is written.
Accurate. If it doesn’t contain truthful information, it’s worse than nothing at all.
Authoritative. It is confusing to have more than one runbook for any given process.
Adaptable. Systems evolve, and if you can’t change your runbook, the drift will make it unusable.

Glossaries

Glossaries³ can be helpful for a few reasons:

Glossaries help you avoid repetition. When you can refer to a definition with a linked explanation, you save time and words.
Glossaries ensure consistency in descriptions. If something is explained in multiple ways, it can become confusing.
Glossaries enhance the usability of a runbook for engineers at all experience levels. By including a glossary in your runbook, you provide essential explanations for newer engineers and streamline information for more experienced ones.

Explanation

RFC
RFD
ADR
paper, wiki, anything that helps why things are how they and how works in detail.

Last updated on Apr 21, 2024