Alerting Principles
Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: observability, quality
- Skillset: backend, devops
- Technology: generic
- Stage: deployment
Summary
Good alerts notify the right people about conditions that require human action, at the right time, with enough context to act immediately. Alert fatigue — the result of too many low-quality, noisy, or unactionable alerts — is one of the leading causes of missed real incidents.
Rationale
Alert fatigue is a patient safety and reliability problem
Research from healthcare ICUs and SRE practice converges on the same finding: when alerts fire frequently and are frequently false positives or unactionable, on-call engineers begin to ignore them. The alert that actually matters gets missed in the noise. Worse, teams silence alerts that have cried wolf too many times, removing visibility entirely.
SRE practitioners (particularly Google’s SRE book) frame this as: every alert that pages a human should be actionable. If there is no action to take, the alert should not page.
Alerting on symptoms, not causes
The most common alerting anti-pattern is alerting on causes (CPU > 80%, error rate > 0, memory > 70%) rather than symptoms (users can’t complete checkout, p99 latency exceeded SLO). Cause-based alerts are noisy (CPU spikes don’t always affect users) and miss novel failure modes. Symptom-based alerts — rooted in Service Level Objectives — alert on what users actually experience.
The SLO-based alerting model
A Service Level Objective (SLO) defines the expected reliability of a service (e.g. “checkout endpoint succeeds for 99.5% of requests over a 30-day rolling window”). Alerts fire when the error budget is burning faster than it can be replenished:
- Fast burn alert: error rate is burning the 30-day budget in < 1 hour → page immediately
- Slow burn alert: error rate is burning the budget in < 3 days → ticket or non-urgent notification
This model reduces noise (transient spikes don’t always burn budget) while ensuring real degradation is caught early.
Guidance
Alert quality checklist
Before creating or keeping an alert:
| Question | If No: action |
|---|---|
| Does this alert indicate a user-impacting condition? | Delete or downgrade to informational |
| Is there a specific action the on-call engineer should take? | Delete or fix the runbook |
| Does the alert include a runbook link? | Add one before enabling |
| Could this alert fire spuriously (transient spike)? | Add minimum duration or burn-rate threshold |
| Is this alert firing more than once a week without action? | Tune or delete |
| Is anyone actually reading alerts from this channel? | Reassign or route to incident management tool |
Alert routing and urgency
| Urgency | Criteria | Channel |
|---|---|---|
| Page (wake-up) | User-impacting, requires immediate action in < 5 mins | PagerDuty/OpsGenie with escalation |
| Urgent notification | Degradation detected, action required within hours | Slack #incidents or ticketing |
| Informational | Trend worth monitoring, no immediate action | Dashboard / daily digest |
Runbook structure
Every paging alert should link directly to a runbook with:
- What this alert means (plain English, one sentence)
- Immediate triage steps (numbered, actionable)
- Common causes and fixes
- Escalation path (who to page next if unresolved at 15 / 30 / 60 minutes)
- Link to relevant dashboards and runbook history
Examples
Symptom-based SLO alert (Prometheus / Alertmanager)
# Alert when checkout success rate falls below SLO
- alert: CheckoutSuccessRateBelowSLO
expr: |
sum(rate(http_requests_total{path="/checkout",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{path="/checkout"}[5m])) < 0.995
for: 2m # Must persist for 2 minutes to avoid transient spikes
labels:
severity: critical
annotations:
summary: "Checkout success rate below 99.5% SLO"
runbook: "https://runbooks.internal/checkout-error-rate"
dashboard: "https://grafana.internal/d/checkout"
Cause-based alert (avoid this pattern)
# Avoid: CPU spike doesn't mean users are affected
- alert: HighCPU
expr: node_cpu_usage > 80
# No runbook, no user-impact framing, fires constantly
Anti-patterns
1. Alerting on every error regardless of severity
A single 500 response is not an incident. Rate-based and budget-based alerting distinguishes transient from sustained failure. Alert on percentages and burn rates, not raw counts.
2. Alerts with no runbook
An engineer paged at 2am with no context on what to do is more likely to acknowledge and ignore the alert than to resolve the incident. Every paging alert must link to an actionable runbook.
3. Alerts sent to a channel nobody reads
Alerts routed to a high-volume Slack channel or an email inbox become invisible through volume. Use a dedicated incident management tool (PagerDuty, OpsGenie) for pages with escalation policies.
4. Alert silencing instead of fixing
Silencing a noisy alert during an incident is appropriate; never removing the silence creates systemic blindness. Track silenced alerts; either tune them or fix the underlying condition within a sprint.
5. No on-call escalation policy
An alert that pages one engineer with no escalation path means a missed page at 3am means no response. Every paging alert needs: primary on-call → secondary → manager escalation, with defined delay between levels.
Related practices
Part of the PushBackLog Best Practices Library. Suggest improvements →