PushBackLog

SLOs, SLIs, and Error Budgets

Advisory enforcement Stub by PushBackLog team
Topic: observability Topic: reliability Topic: sre Skillset: devops Skillset: platform Technology: generic Stage: operations Stage: architecture

SLOs, SLIs, and Error Budgets

Status: Stub Category: Observability Default enforcement: Advisory Author: PushBackLog team


Tags

  • Topic: observability, reliability, sre
  • Skillset: devops, platform
  • Technology: generic
  • Stage: operations, architecture

Summary

Service Level Indicators (SLIs) are quantitative measures of service behaviour. Service Level Objectives (SLOs) are target thresholds for those measures. Error budgets are the permitted amount of unreliability before corrective action is required. Together, SLIs, SLOs, and error budgets provide a data-driven framework for making explicit trade-offs between reliability, feature velocity, and operational investment.


Rationale

Without defined reliability targets, every outage is equally critical and every deployment is equally risky. Teams without SLOs either over-react to minor degradations (treating everything as an emergency) or under-react to genuine reliability problems (because there is no agreed standard to compare against).

SLOs externalise reliability expectations as shared agreements. They enable rational decision-making: when the error budget is healthy, the team can ship features; when the error budget is depleted, reliability work takes priority. This makes reliability a technical and business conversation rather than a purely reactive operational one.


Guidance

Choosing SLIs

An SLI is a carefully defined quantitative measure of the property you care about. Good SLIs:

  • Are measurable from data you already collect (or can collect cheaply)
  • Reflect the user experience — not internal implementation details
  • Map to the four golden signals: availability, latency, throughput, and error rate

Common SLIs:

Service typeTypical SLIs
HTTP APIRequest success rate, p95 latency, p99 latency
Async workerJob completion rate, processing lag
Batch jobCompletion rate, duration vs. target
Data pipelineData freshness, record error rate

Setting SLOs

An SLO is a target percentile or rate for an SLI over a rolling window. SLOs should:

  • Be set based on actual user impact, not aspirationally
  • Begin conservatively — it is easier to tighten an SLO than to explain why you missed one
  • Be agreed upon by both engineering and the business/product teams it serves
  • Reflect a rolling window (28-day rolling is common) rather than a calendar month

Example SLO definition:

“99.5% of requests to the /api/tasks endpoint return a successful response (HTTP 2xx) within 300ms over a rolling 28-day window.”

An SLO is not an SLA. An SLA (Service Level Agreement) is an external commitment with financial consequences. An SLO is an internal operational target. SLOs should be slightly more ambitious than SLAs to preserve headroom.

Error budgets

The error budget is the complement of the SLO: an SLO of 99.5% implies a 0.5% error budget — the permitted proportion of “bad” events before the SLO is breached.

Error budgets answer: “how much unreliability do we have remaining this period?”

When error budget is healthy (> 50% remaining):

  • Feature development can proceed at normal pace
  • Risky deployments (large changes, schema migrations) are acceptable

When error budget is tight (< 25% remaining):

  • Reliability improvements are prioritised in the backlog
  • Risky deployments are deferred or require additional validation

When error budget is exhausted:

  • A feature freeze on risky changes is triggered
  • Reliability work takes priority until the budget recovers

Burn rate alerts

Rather than alerting when an SLO is breached (too late), alert on burn rate: how fast the error budget is being consumed relative to the expected pace.

Burn rateMeaningAlert action
Budget consumed at exactly the expected rateNo alert
Budget will be exhausted in 4–5 daysPagerDuty / ticket
14.4×Budget will be exhausted in 1 hourWake someone up

Multi-window burn rate alerts (compare a short window against a longer window) reduce false positives from brief spikes.


Common failure modes

FailureDescription
Aspirational SLOsSLOs set to 99.99% “because that sounds good”; never achievable; error budget always depleted
Measuring the wrong thingSLIs track internal metrics (CPU, memory) rather than user-visible outcomes
SLO as a pass/fail gradeTeams optimise for SLO compliance rather than actual reliability
No error budget policyError budgets calculated but no agreed process for what to do when depleted
Alerting on SLO breachAlerts fire only after the SLO is already breached, providing no time to react

Part of the PushBackLog Best Practices Library