On-Call Best Practices
Status: Complete
Category: Observability
Default enforcement: Soft
Author: PushBackLog team
Tags
- Topic: observability, operations, reliability
- Skillset: devops, engineering, engineering-management
- Technology: PagerDuty, OpsGenie
- Stage: operations
Summary
On-call is a structured rotation in which engineers are available outside normal working hours to respond to production incidents. Done well, on-call is a manageable burden that keeps engineers accountable for the reliability of the systems they build. Done poorly — with too many pages, poor tooling, and no support — on-call destroys morale, causes burnout, and increases attrition. The engineering leader’s job is to make on-call sustainable: configure alerting to fire only on real, actionable problems, ensure responders have runbooks and context, and treat every page as a signal about system quality.
Rationale
Alerting quality directly determines on-call quality
The most common on-call failure mode is alert fatigue: too many pages, most of them non-actionable or intermittent, to the point where responders begin ignoring alerts or treating all pages as noise. A single false-positive alert at 3 AM erodes trust in the alerting system. A pattern of false positives trains responders to assume alerts are false — at which point a real incident is missed.
Every alert should be actionable: a human being woken at 3 AM should have exactly one question — “what do I do?” — and the answer should be documented.
Engineers who own what they build make better systems
“You build it, you run it” is the principle that engineers on-call for their own services have a direct incentive to make those services reliable and observable. When operations is a separate team that gets paged for things engineers built, the feedback loop is broken — engineers don’t experience the consequences of their reliability decisions. Teams that are responsible for their own on-call tend to invest in better alerting, more runbooks, and more reliable systems.
Guidance
On-call rotation structure
Primary: Engineer A (first responder, 24/7 availability during rotation)
Secondary: Engineer B (backup if primary doesn't acknowledge in 15 minutes)
Manager: Sandra W (escalation for major incidents, context and authority)
Rotation cycle: 1 week
Handoff: Mondays at 09:00 local time
Rotation length of one week is standard. Two-week rotations exist but lead to more fatigue; one-day rotations have too much handoff overhead.
Team size minimum for sustainable on-call:
- With 1-week rotations: 4+ engineers means each person is on-call every 4+ weeks
- < 4 engineers on a rotation creates chronic fatigue; delay hiring or merge rotations
Alert quality standards
Alerts should meet the following criteria before being enabled in production:
| Criterion | Description |
|---|---|
| Actionable | The responder must be able to do something in response |
| Accurate | The alert triggers for real problems, not transient noise |
| Documented | A runbook exists that explains how to respond |
| Appropriate severity | P1/page-now vs P2/business-hours distinction is correct |
| Customer-impacting | The alert represents genuine user impact, not internal implementation detail |
An alert that is consistently false-positive, consistently resolved without action, or consistently safe to ignore until morning should be:
- Converted to a non-paging ticket; or
- Fixed so it only fires when it should; or
- Deleted
Severity levels
| Level | Criteria | Response time | Communication |
|---|---|---|---|
| P1 - Critical | Service down, data loss, major revenue impact | Immediate (< 5 min) | Status page update within 15 min |
| P2 - High | Degraded service, significant user impact | 30 minutes | Status page update |
| P3 - Medium | Minor degradation, workaround exists | Business hours | Ticket |
| P4 - Low | No current user impact | Scheduled | Backlog |
Only P1/P2 should page on-call out of hours.
Runbook structure
Every production alert must have a linked runbook:
# [Alert Name] Runbook
## Trigger condition
High error rate on POST /api/payments — 5xx rate > 1% for 5 minutes
## User impact
Users unable to complete purchases. Revenue impact: ~$X/minute.
## First steps (under 2 minutes)
1. Check dashboard: [link]
2. Check recent deployments: [link]
3. Check downstream dependency status: [link]
## Common causes and resolution
### Stripe API timeout
**Symptom**: High share of `PaymentGatewayTimeoutError` in logs
**Resolution**: Check Stripe status page. If degraded, enable fallback to secondary processor:
```bash
feature-flags set stripe-primary-enabled false
Database connection exhaustion
Symptom: PoolExhaustedError in logs, DB pool metric at max
Resolution: Investigate long-running queries:
SELECT pid, duration, query FROM pg_stat_activity WHERE state = 'active';
Escalation
- If unresolved after 30 minutes, escalate to: [secondary on-call]
- Payment team: [contact]
### Reducing toil
Toil is repetitive, manual operational work that does not produce lasting value. Measuring and reducing toil is a core SRE practice:
- **Automated alert response**: for alerts with a known, safe automated fix (e.g., cache flush, pod restart), use runbook automation (PagerDuty Runbook Automation, AWS Systems Manager Automation) to perform the initial response automatically
- **Track pages per rotation**: high page rates per engineer per week (> 5 pages per shift) are a signal of systemic alerting or reliability issues — treat it as a bug
- **Post-incident toil tracking**: after each incident, note manual steps that could be automated in the next sprint
### Post-incident process
After every P1/P2 incident:
1. **Timeline document**: what happened, chronologically
2. **5 Whys** or equivalent root cause analysis
3. **Action items**: concrete tasks to prevent recurrence
4. **Blameless review**: share findings with the team, not attributed to individuals
Track open action items from incidents as first-class work — deprioritising them is a silent acceptance of future recurrence.
### Review checklist
- [ ] Every production alert has a documented runbook
- [ ] On-call rotation has at least one backup (secondary)
- [ ] Alert threshold review conducted quarterly — false-positive alerts are fixed or removed
- [ ] Pages per shift are tracked; > 5 actionable pages/shift triggers a reliability review
- [ ] Engineers have time during business hours to work on reliability improvements from incidents
- [ ] Manager escalation path exists for major incidents