Service reliability in You Build It You Run It
An on-call product team offers 24/7 production support, and they can modify all aspects of a digital service:
  • Alert definitions
  • Code
  • Configuration
  • Data
  • Deployments
  • Infrastructure definitions
  • Logging
  • Monitoring dashboards
You Build It You Run It co-exists in a hybrid operating model with Ops Run It, and it's important to have specialist operational teams as well as generalist, cross-functional product teams. The same operational enablers offer their scarce, deep expertise to on-call product teams for digital services, and to the application support team for foundational systems. For example, a shared DBA team can assist with database provisioning, performance, and operations.
For governance, the product teams are responsible for day-to-day work in availability protection and availability restoration. They track their costs on a team-by-team basis, and make them visible for senior leaders.

Availability protection in You Build It You Run It

An on-call product team is responsible for availability protection. This means proactively monitoring service telemetry and updating digital services during working hours. A product team observes service health checks, logs, and metrics plumbed into different dashboards, in telemetry tools such as AWS CloudWatch or Grafana. They can alter the monitorable events whenever necessary, to improve the information value of their telemetry data.
A product team also updates their digital services by adding infrastructure capacity, making configuration changes, applying fixes, and updating alerts as necessary. This is prioritised alongside feature development, as it is the product team themselves who are on-call.

Availability restoration in You Build It You Run It

An on-call product team is responsible for availability restoration. This means reactively responding to production alerts in and out of working hours, including evenings, weekends, and bank holidays. All service alerts are expected to be resolved by the on-call product team, and incident response is consistently prioritised over feature development.
When an availability target is breached, an on-call product team developer receives an automated alert, via an incident response platform such as PagerDuty or VictorOps. The responder acknowledges the alert, and a ticket is automatically created in a ticketing system such as ServiceNow, by the incident response platform. The responder classifies and prioritises the incident, and adds more team members to the incident as required.
Responders observe their real-time service telemetry data, to understand the drift from normal to abnormal operating conditions. They diagnose the incident via their heuristics, innate service knowledge, and telemetry data. They attempt to restore availability via additional infrastructure, code changes, data fixes, configuration changes, rollbacks, and telemetry changes.
For a high priority incident, the entire team may swarm on incident response, to minimise service unavailability. One team member acts as incident commander, to coordinate with other product teams involved in incident response, and to manage communications with senior stakeholders. Alternatively, You Build It You Run It is 100% compatible with ITIL v3, and an incident manager could be invited by the team to act as incident commander for the duration of an incident.

Service reliability costs in You Build It You Run It

Cost Type
Frequency
Description
Impact
TCO %
Setup cost
One-off
Launch costs incurred in
  • License purchases
  • Product team time for telemetry install
  • Product team time for on-call schedule agreement
  • Product team time to setup live access Product team time for any operational training necessary
Capex cost
Medium
Transition cost
One-off
Launch costs incurred in
  • Product team time for runbooks
Capex cost
Low
Run cost
Ongoing
Regular costs incurred in product team time for
  • Deploying code changes
  • Applying data fixes
  • Adding infrastructure capacity
  • Monitoring operating conditions
  • Performing rollbacks
  • Updating telemetry tools
  • Doing on-call standby out of hours
Capex cost
Medium
Incident cost
Per incident
Incident response costs incurred in product team time for
  • Investigating and diagnosing problems
  • Identifying and agreeing on solutions including code changes
  • configuration updates, adding infrastructure capacity
  • On-call callout out of hours
Capex cost
Low to medium
Opportunity cost
Per incident
Can be measured as the cost of delay between incident start and incident finish. Caused by service unavailability, missed opportunities with customers, and delays in further feature development
Revenue loss and costs incurred
Low
Last modified 30d ago