These practices explain how to scale You Build It You Run It across a large organisation with many product teams and many digital services. Without these practices, you'll suffer from the Linear run cost pitfall.
At Equal Experts, we don't believe in prescriptive scaling frameworks. We believe in applying the same holistic principles and practices for deployment throughput, service reliability, and learning culture to 1, 10, or 50 product teams, in a cost-effective way.
- 3.Introduce additional selection practices for finer-grained operating model choices.
As the number of your product teams and digital services scales up, there can be a temptation to centralise some incident response in a new operations team. It's important to resist this idea, because it's just another form of Ops Run It that will damage delivery throughput, service reliability, and learning culture.
Ensure that availability targets are selected on financial exposure, and match an on-call level to an availability target. This achieves a balance between financial exposure, run costs, and remuneration for on-call product teams developers that doesn't weaken operability incentives.
You Build It You Run It protects business outcomes. It doesn't mean every digital service needs 24x7 on-call support. Not every digital service needs to be 99.99% available, and always on. Some digital services have a low level of financial exposure and don't need out of hours support, some have a medium level of exposure warranting some out of hours support, and some have a high level of exposure that justifies dedicated out of hours support.
When an availability target is selected for a digital service based on its financial exposure, it is assigned a level of on-call support in addition to its tolerable downtime per week.
In our Selection practices, there's a furniture retailer example with a third party COTS ecommerce platform, custom bedroom frontend, and a custom appointments frontend. The financial exposure bands linked to different availability targets can be updated to include levels of on-call:
In working hours, a product team always has a team schedule, and is accountable for the reliability of its digital services. Out of hours there could be no schedule, a domain schedule shared between teams, or a team schedule again.
Don't do out of hours on-call for your digital services, when you have these desired outcomes:
- Weekly to daily deployments, or more.
- 95.0% availability protection.
- 9 hours of tolerable unavailability per week.
Reduce on-call standby costs while incentivising product teams to care about operability by promoting working hours ownership and on-call.
If a production incident happens during working hours, there is an immediate callout to the owning product team, and they respond to the incident based on their in-incident calculation of financial loss. If an incident happens out of hours, the callout is suppressed until the start of the next working day. This incentivises a product team to build operability into a digital service without out of hours support, in order to avoid a production incident spilling over into the next working day.
For the furniture retailer, the appointments frontend has a 95.0% availability target, which matches to no out of hours schedule.
It's important to protect operability incentives for product teams who are only on-call during working hours. If your organisation has Ops Run It for foundational systems, ensure that digital services cannot be covered out of hours by that operating model.
Share out of hours on-call schedules between sibling product teams for your digital services, when you have these desired outcomes:
- Weekly, daily, or more frequent deployments.
- 99.0% availability protection.
- 2 hours of tolerable unavailability per week.
Reduce on-call standby costs while incentivising product teams to care about operability, even when they're on-call infrequently.
A domain schedule is a logical grouping of digital services, with an established affinity. The owning product teams are considered to be siblings. The domain construct needs to minimise on-call cognitive load, simplify knowledge sharing between teams, and focus on business outcomes. We recommend either of the following:
- Product domains grouped by customer journey
- Architectural domains grouped by technology capabilities
We don't recommend geographic domains grouped by region, or technology domains grouped by tool choices. Either approach produces a confusing jumble of digital services, with cross-cutting product boundaries and a high cognitive load. This has a negative impact on the time to diagnose and resolve production incidents.
For the furniture retailer, the bedroom frontend has a 99.0% availability target, which matches to domain out of hours on-call. The owning product team has to identify other digital services in the same product domain, and work with those product teams to establish a shared domain on-call schedule.
Domain schedules balance strong operability incentives with run costs. They aren't perfect, and have their own complications:
- Domain ambiguity. Product teams may disagree on what a domain is, if there isn't a consistent, rich language in a particular organisational context.
- Domain affinity. Product teams may disagree on which digital services comprise a domain, or wish to run their own team schedules regardless of availability targets.
- Domain knowledge synchronisation costs. The cost of sharing knowledge about multiple digital services between multiple product teams can be high. Knowledge sharing across a business context can be difficult. Knowledge sharing across technology choices can be reduced, if teams are encouraged to use the same tools for their digital services wherever possible.
- Domain on-call funding. Budget holders for different product teams in the same domain need to choose one of them as the sole budget holder for on-call funding.
These can be mitigated by ensuring product teams are aware of domains and on-call responsibilities from day one. The sooner a product team is aware they're due to share a domain schedule with another team, the easier that eventual process will be.
Do out of hours on-call at scale for your digital services, when you have these desired outcomes:
- Weekly to daily deployments, or more.
- 99.9% or more availability protection.
- 10 mins or less of tolerable unavailability per week.
Maximise incentives for product teams to care about operability.
This is no different from the standard You Build It You Run It. There should be a small number of digital services that require this on-call level. A high number of product teams with their own team schedules is a weak signal that something is wrong in availability target selection.