PushBackLog

Alerting Principles

Advisory enforcement Complete by PushBackLog team
Topic: observability Topic: quality Skillset: backend Skillset: devops Technology: generic Stage: deployment

Alerting Principles

Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: observability, quality
  • Skillset: backend, devops
  • Technology: generic
  • Stage: deployment

Summary

Good alerts notify the right people about conditions that require human action, at the right time, with enough context to act immediately. Alert fatigue — the result of too many low-quality, noisy, or unactionable alerts — is one of the leading causes of missed real incidents.


Rationale

Alert fatigue is a patient safety and reliability problem

Research from healthcare ICUs and SRE practice converges on the same finding: when alerts fire frequently and are frequently false positives or unactionable, on-call engineers begin to ignore them. The alert that actually matters gets missed in the noise. Worse, teams silence alerts that have cried wolf too many times, removing visibility entirely.

SRE practitioners (particularly Google’s SRE book) frame this as: every alert that pages a human should be actionable. If there is no action to take, the alert should not page.

Alerting on symptoms, not causes

The most common alerting anti-pattern is alerting on causes (CPU > 80%, error rate > 0, memory > 70%) rather than symptoms (users can’t complete checkout, p99 latency exceeded SLO). Cause-based alerts are noisy (CPU spikes don’t always affect users) and miss novel failure modes. Symptom-based alerts — rooted in Service Level Objectives — alert on what users actually experience.

The SLO-based alerting model

A Service Level Objective (SLO) defines the expected reliability of a service (e.g. “checkout endpoint succeeds for 99.5% of requests over a 30-day rolling window”). Alerts fire when the error budget is burning faster than it can be replenished:

  • Fast burn alert: error rate is burning the 30-day budget in < 1 hour → page immediately
  • Slow burn alert: error rate is burning the budget in < 3 days → ticket or non-urgent notification

This model reduces noise (transient spikes don’t always burn budget) while ensuring real degradation is caught early.


Guidance

Alert quality checklist

Before creating or keeping an alert:

QuestionIf No: action
Does this alert indicate a user-impacting condition?Delete or downgrade to informational
Is there a specific action the on-call engineer should take?Delete or fix the runbook
Does the alert include a runbook link?Add one before enabling
Could this alert fire spuriously (transient spike)?Add minimum duration or burn-rate threshold
Is this alert firing more than once a week without action?Tune or delete
Is anyone actually reading alerts from this channel?Reassign or route to incident management tool

Alert routing and urgency

UrgencyCriteriaChannel
Page (wake-up)User-impacting, requires immediate action in < 5 minsPagerDuty/OpsGenie with escalation
Urgent notificationDegradation detected, action required within hoursSlack #incidents or ticketing
InformationalTrend worth monitoring, no immediate actionDashboard / daily digest

Runbook structure

Every paging alert should link directly to a runbook with:

  1. What this alert means (plain English, one sentence)
  2. Immediate triage steps (numbered, actionable)
  3. Common causes and fixes
  4. Escalation path (who to page next if unresolved at 15 / 30 / 60 minutes)
  5. Link to relevant dashboards and runbook history

Examples

Symptom-based SLO alert (Prometheus / Alertmanager)

# Alert when checkout success rate falls below SLO
- alert: CheckoutSuccessRateBelowSLO
  expr: |
    sum(rate(http_requests_total{path="/checkout",status=~"2.."}[5m]))
    /
    sum(rate(http_requests_total{path="/checkout"}[5m])) < 0.995
  for: 2m      # Must persist for 2 minutes to avoid transient spikes
  labels:
    severity: critical
  annotations:
    summary: "Checkout success rate below 99.5% SLO"
    runbook: "https://runbooks.internal/checkout-error-rate"
    dashboard: "https://grafana.internal/d/checkout"

Cause-based alert (avoid this pattern)

# Avoid: CPU spike doesn't mean users are affected
- alert: HighCPU
  expr: node_cpu_usage > 80
  # No runbook, no user-impact framing, fires constantly

Anti-patterns

1. Alerting on every error regardless of severity

A single 500 response is not an incident. Rate-based and budget-based alerting distinguishes transient from sustained failure. Alert on percentages and burn rates, not raw counts.

2. Alerts with no runbook

An engineer paged at 2am with no context on what to do is more likely to acknowledge and ignore the alert than to resolve the incident. Every paging alert must link to an actionable runbook.

3. Alerts sent to a channel nobody reads

Alerts routed to a high-volume Slack channel or an email inbox become invisible through volume. Use a dedicated incident management tool (PagerDuty, OpsGenie) for pages with escalation policies.

4. Alert silencing instead of fixing

Silencing a noisy alert during an incident is appropriate; never removing the silence creates systemic blindness. Track silenced alerts; either tune them or fix the underlying condition within a sprint.

5. No on-call escalation policy

An alert that pages one engineer with no escalation path means a missed page at 3am means no response. Every paging alert needs: primary on-call → secondary → manager escalation, with defined delay between levels.



Part of the PushBackLog Best Practices Library. Suggest improvements →