SLOs, SLIs, and Error Budgets
Status: Stub Category: Observability Default enforcement: Advisory Author: PushBackLog team
Tags
- Topic: observability, reliability, sre
- Skillset: devops, platform
- Technology: generic
- Stage: operations, architecture
Summary
Service Level Indicators (SLIs) are quantitative measures of service behaviour. Service Level Objectives (SLOs) are target thresholds for those measures. Error budgets are the permitted amount of unreliability before corrective action is required. Together, SLIs, SLOs, and error budgets provide a data-driven framework for making explicit trade-offs between reliability, feature velocity, and operational investment.
Rationale
Without defined reliability targets, every outage is equally critical and every deployment is equally risky. Teams without SLOs either over-react to minor degradations (treating everything as an emergency) or under-react to genuine reliability problems (because there is no agreed standard to compare against).
SLOs externalise reliability expectations as shared agreements. They enable rational decision-making: when the error budget is healthy, the team can ship features; when the error budget is depleted, reliability work takes priority. This makes reliability a technical and business conversation rather than a purely reactive operational one.
Guidance
Choosing SLIs
An SLI is a carefully defined quantitative measure of the property you care about. Good SLIs:
- Are measurable from data you already collect (or can collect cheaply)
- Reflect the user experience — not internal implementation details
- Map to the four golden signals: availability, latency, throughput, and error rate
Common SLIs:
| Service type | Typical SLIs |
|---|---|
| HTTP API | Request success rate, p95 latency, p99 latency |
| Async worker | Job completion rate, processing lag |
| Batch job | Completion rate, duration vs. target |
| Data pipeline | Data freshness, record error rate |
Setting SLOs
An SLO is a target percentile or rate for an SLI over a rolling window. SLOs should:
- Be set based on actual user impact, not aspirationally
- Begin conservatively — it is easier to tighten an SLO than to explain why you missed one
- Be agreed upon by both engineering and the business/product teams it serves
- Reflect a rolling window (28-day rolling is common) rather than a calendar month
Example SLO definition:
“99.5% of requests to the
/api/tasksendpoint return a successful response (HTTP 2xx) within 300ms over a rolling 28-day window.”
An SLO is not an SLA. An SLA (Service Level Agreement) is an external commitment with financial consequences. An SLO is an internal operational target. SLOs should be slightly more ambitious than SLAs to preserve headroom.
Error budgets
The error budget is the complement of the SLO: an SLO of 99.5% implies a 0.5% error budget — the permitted proportion of “bad” events before the SLO is breached.
Error budgets answer: “how much unreliability do we have remaining this period?”
When error budget is healthy (> 50% remaining):
- Feature development can proceed at normal pace
- Risky deployments (large changes, schema migrations) are acceptable
When error budget is tight (< 25% remaining):
- Reliability improvements are prioritised in the backlog
- Risky deployments are deferred or require additional validation
When error budget is exhausted:
- A feature freeze on risky changes is triggered
- Reliability work takes priority until the budget recovers
Burn rate alerts
Rather than alerting when an SLO is breached (too late), alert on burn rate: how fast the error budget is being consumed relative to the expected pace.
| Burn rate | Meaning | Alert action |
|---|---|---|
| 1× | Budget consumed at exactly the expected rate | No alert |
| 6× | Budget will be exhausted in 4–5 days | PagerDuty / ticket |
| 14.4× | Budget will be exhausted in 1 hour | Wake someone up |
Multi-window burn rate alerts (compare a short window against a longer window) reduce false positives from brief spikes.
Common failure modes
| Failure | Description |
|---|---|
| Aspirational SLOs | SLOs set to 99.99% “because that sounds good”; never achievable; error budget always depleted |
| Measuring the wrong thing | SLIs track internal metrics (CPU, memory) rather than user-visible outcomes |
| SLO as a pass/fail grade | Teams optimise for SLO compliance rather than actual reliability |
| No error budget policy | Error budgets calculated but no agreed process for what to do when depleted |
| Alerting on SLO breach | Alerts fire only after the SLO is already breached, providing no time to react |