SLOs, SLIs, and Error Budgets

Status: Stub Category: Observability Default enforcement: Advisory Author: PushBackLog team

Summary

Service Level Indicators (SLIs) are quantitative measures of service behaviour. Service Level Objectives (SLOs) are target thresholds for those measures. Error budgets are the permitted amount of unreliability before corrective action is required. Together, SLIs, SLOs, and error budgets provide a data-driven framework for making explicit trade-offs between reliability, feature velocity, and operational investment.

Rationale

Without defined reliability targets, every outage is equally critical and every deployment is equally risky. Teams without SLOs either over-react to minor degradations (treating everything as an emergency) or under-react to genuine reliability problems (because there is no agreed standard to compare against).

SLOs externalise reliability expectations as shared agreements. They enable rational decision-making: when the error budget is healthy, the team can ship features; when the error budget is depleted, reliability work takes priority. This makes reliability a technical and business conversation rather than a purely reactive operational one.

Guidance

Choosing SLIs

An SLI is a carefully defined quantitative measure of the property you care about. Good SLIs:

Are measurable from data you already collect (or can collect cheaply)
Reflect the user experience — not internal implementation details
Map to the four golden signals: availability, latency, throughput, and error rate

Common SLIs:

Service type	Typical SLIs
HTTP API	Request success rate, p95 latency, p99 latency
Async worker	Job completion rate, processing lag
Batch job	Completion rate, duration vs. target
Data pipeline	Data freshness, record error rate

Setting SLOs

An SLO is a target percentile or rate for an SLI over a rolling window. SLOs should:

Be set based on actual user impact, not aspirationally
Begin conservatively — it is easier to tighten an SLO than to explain why you missed one
Be agreed upon by both engineering and the business/product teams it serves
Reflect a rolling window (28-day rolling is common) rather than a calendar month

Example SLO definition:

“99.5% of requests to the /api/tasks endpoint return a successful response (HTTP 2xx) within 300ms over a rolling 28-day window.”

An SLO is not an SLA. An SLA (Service Level Agreement) is an external commitment with financial consequences. An SLO is an internal operational target. SLOs should be slightly more ambitious than SLAs to preserve headroom.

Error budgets

The error budget is the complement of the SLO: an SLO of 99.5% implies a 0.5% error budget — the permitted proportion of “bad” events before the SLO is breached.

Error budgets answer: “how much unreliability do we have remaining this period?”

When error budget is healthy (> 50% remaining):

Feature development can proceed at normal pace
Risky deployments (large changes, schema migrations) are acceptable

When error budget is tight (< 25% remaining):

Reliability improvements are prioritised in the backlog
Risky deployments are deferred or require additional validation

When error budget is exhausted:

A feature freeze on risky changes is triggered
Reliability work takes priority until the budget recovers

Burn rate alerts

Rather than alerting when an SLO is breached (too late), alert on burn rate: how fast the error budget is being consumed relative to the expected pace.

Burn rate	Meaning	Alert action
1×	Budget consumed at exactly the expected rate	No alert
6×	Budget will be exhausted in 4–5 days	PagerDuty / ticket
14.4×	Budget will be exhausted in 1 hour	Wake someone up

Multi-window burn rate alerts (compare a short window against a longer window) reduce false positives from brief spikes.

Common failure modes

Failure	Description
Aspirational SLOs	SLOs set to 99.99% “because that sounds good”; never achievable; error budget always depleted
Measuring the wrong thing	SLIs track internal metrics (CPU, memory) rather than user-visible outcomes
SLO as a pass/fail grade	Teams optimise for SLO compliance rather than actual reliability
No error budget policy	Error budgets calculated but no agreed process for what to do when depleted
Alerting on SLO breach	Alerts fire only after the SLO is already breached, providing no time to react

SLOs, SLIs, and Error Budgets

SLOs, SLIs, and Error Budgets

Tags

Summary

Rationale

Guidance

Choosing SLIs

Setting SLOs

Error budgets

Burn rate alerts

Common failure modes

Part of the PushBackLog Best Practices Library