For overview about oncalls I recommend read chapter 14 Oncall 1.
But now to my example:
- We have team with primary and secondary roster
- We can find what is best for us using 1 - 3 sites
- We are looking how many people we need and how we calculate costs
- We don’t try to focus on daily shifts that will make example more complicated
- We don’t use combinations as primary in daily and secondary weekly that make example more complicated too
- We’re skipping many possible combinations, don’t think that is only good way how to do it, this is more about how you think for your use case
Some numbers first, calculate and think what is important to you and if you have enough people to achieve at least this minimum number of people with your conditions. My example calculates with two roasters, and we can choose between 1-3 sites.
We have 8 people that rotate every 4 weeks. If you can share primary and secondary you have to change this into 2 months and next month just change primary and secondary. I don’t recommend making 2 shifts in rows as many teams do. It’s not healthy. Still all these tables are a minimum number of people that isn’t optimal. You need extra capacity for vacations, sick people and not very near that 25%.
We have 8 people rotated every 4 weeks. Recommendation from SRE book is 6 people per site and I agree. This calculation is bare minimum and I will recommend at least 6 people.
We have 9 people rotate every 3 weeks. Still I will go with 6 people per site.
Many other scenarios you can find here in PagerDuty documentation. I recommend looking into it for inspiration.
The last important thing to multi-sites is time zones. You can’t often choose where your teams are located. If you can, try to make it work for oncall, which needs enough distance for shifts, but fewer gaps for communication. It’s hard to find an optimal way.
The 8 hours time zone difference works well from my perspective. I make an example with a few time zones and what three sites and two sites look like.
I worked with US (PST) and EU teams, and common time is very limited, 2-3 hours for communication, but for two site shifts, it works well. If you need some 3rd place, looking east as Singapore or Thailand is good.
Many US teams use India, but that doesn’t work with Europe for on-calls. The time difference between Paris and Mumbai is 3.5h, which is too small to help with night shifts.
T. A. Limoncelli, S. R. Chalup, and C. J. Hogan, The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2: Addison-Wesley, 2014. - https://learning.oreilly.com/library/view/practice-of-cloud/9780133478549/title.html ↩︎