Service reliability in You Build It You Run It
An on-call product team offers 24/7 production support, and they can modify all aspects of a digital service:
- Alert definitions
- Infrastructure definitions
- Monitoring dashboards
You Build It You Run It co-exists in a hybrid operating model with Ops Run It, and it's important to have specialist operational teams as well as generalist, cross-functional product teams. The same operational enablers offer their scarce, deep expertise to on-call product teams for digital services, and to the application support team for foundational systems. For example, a shared DBA team can assist with database provisioning, performance, and operations.
For governance, the product teams are responsible for day-to-day work in availability protection and availability restoration. They track their costs on a team-by-team basis, and make them visible for senior leaders.
An on-call product team is responsible for availability protection. This means proactively monitoring service telemetry and updating digital services during working hours. A product team observes service health checks, logs, and metrics plumbed into different dashboards, in telemetry tools such as AWS CloudWatch or Grafana. They can alter the monitorable events whenever necessary, to improve the information value of their telemetry data.
A product team also updates their digital services by adding infrastructure capacity, making configuration changes, applying fixes, and updating alerts as necessary. This is prioritised alongside feature development, as it is the product team themselves who are on-call.
An on-call product team is responsible for availability restoration. This means reactively responding to production alerts in and out of working hours, including evenings, weekends, and bank holidays. All service alerts are expected to be resolved by the on-call product team, and incident response is consistently prioritised over feature development.
When an availability target is breached, an on-call product team developer receives an automated alert, via an incident response platform such as PagerDuty or VictorOps. The responder acknowledges the alert, and a ticket is automatically created in a ticketing system such as ServiceNow, by the incident response platform. The responder classifies and prioritises the incident, and adds more team members to the incident as required.
Responders observe their real-time service telemetry data, to understand the drift from normal to abnormal operating conditions. They diagnose the incident via their heuristics, innate service knowledge, and telemetry data. They attempt to restore availability via additional infrastructure, code changes, data fixes, configuration changes, rollbacks, and telemetry changes.
For a high priority incident, the entire team may swarm on incident response, to minimise service unavailability. One team member acts as incident commander, to coordinate with other product teams involved in incident response, and to manage communications with senior stakeholders. Alternatively, You Build It You Run It is 100% compatible with ITIL v3, and an incident manager could be invited by the team to act as incident commander for the duration of an incident.
Launch costs incurred in
Launch costs incurred in
Regular costs incurred in product team time for
Incident response costs incurred in product team time for
Low to medium
Can be measured as the cost of delay between incident start and incident finish. Caused by service unavailability, missed opportunities with customers, and delays in further feature development
Revenue loss and costs incurred