On-Call Best Practices

Status: Complete
Category: Observability
Default enforcement: Soft
Author: PushBackLog team

Summary

On-call is a structured rotation in which engineers are available outside normal working hours to respond to production incidents. Done well, on-call is a manageable burden that keeps engineers accountable for the reliability of the systems they build. Done poorly — with too many pages, poor tooling, and no support — on-call destroys morale, causes burnout, and increases attrition. The engineering leader’s job is to make on-call sustainable: configure alerting to fire only on real, actionable problems, ensure responders have runbooks and context, and treat every page as a signal about system quality.

Rationale

Alerting quality directly determines on-call quality

The most common on-call failure mode is alert fatigue: too many pages, most of them non-actionable or intermittent, to the point where responders begin ignoring alerts or treating all pages as noise. A single false-positive alert at 3 AM erodes trust in the alerting system. A pattern of false positives trains responders to assume alerts are false — at which point a real incident is missed.

Every alert should be actionable: a human being woken at 3 AM should have exactly one question — “what do I do?” — and the answer should be documented.

Engineers who own what they build make better systems

“You build it, you run it” is the principle that engineers on-call for their own services have a direct incentive to make those services reliable and observable. When operations is a separate team that gets paged for things engineers built, the feedback loop is broken — engineers don’t experience the consequences of their reliability decisions. Teams that are responsible for their own on-call tend to invest in better alerting, more runbooks, and more reliable systems.

Guidance

On-call rotation structure

Primary:   Engineer A  (first responder, 24/7 availability during rotation)
Secondary: Engineer B  (backup if primary doesn't acknowledge in 15 minutes)
Manager:   Sandra W    (escalation for major incidents, context and authority)

Rotation cycle: 1 week
Handoff: Mondays at 09:00 local time

Rotation length of one week is standard. Two-week rotations exist but lead to more fatigue; one-day rotations have too much handoff overhead.

Team size minimum for sustainable on-call:

With 1-week rotations: 4+ engineers means each person is on-call every 4+ weeks
< 4 engineers on a rotation creates chronic fatigue; delay hiring or merge rotations

Alert quality standards

Alerts should meet the following criteria before being enabled in production:

Criterion	Description
Actionable	The responder must be able to do something in response
Accurate	The alert triggers for real problems, not transient noise
Documented	A runbook exists that explains how to respond
Appropriate severity	P1/page-now vs P2/business-hours distinction is correct
Customer-impacting	The alert represents genuine user impact, not internal implementation detail

An alert that is consistently false-positive, consistently resolved without action, or consistently safe to ignore until morning should be:

Converted to a non-paging ticket; or
Fixed so it only fires when it should; or
Deleted

Severity levels

Level	Criteria	Response time	Communication
P1 - Critical	Service down, data loss, major revenue impact	Immediate (< 5 min)	Status page update within 15 min
P2 - High	Degraded service, significant user impact	30 minutes	Status page update
P3 - Medium	Minor degradation, workaround exists	Business hours	Ticket
P4 - Low	No current user impact	Scheduled	Backlog

Only P1/P2 should page on-call out of hours.

Runbook structure

Every production alert must have a linked runbook:

# [Alert Name] Runbook

## Trigger condition
High error rate on POST /api/payments — 5xx rate > 1% for 5 minutes

## User impact
Users unable to complete purchases. Revenue impact: ~$X/minute.

## First steps (under 2 minutes)
1. Check dashboard: [link]
2. Check recent deployments: [link]
3. Check downstream dependency status: [link]

## Common causes and resolution

### Stripe API timeout
**Symptom**: High share of `PaymentGatewayTimeoutError` in logs  
**Resolution**: Check Stripe status page. If degraded, enable fallback to secondary processor:
```bash
feature-flags set stripe-primary-enabled false

Database connection exhaustion

Symptom: PoolExhaustedError in logs, DB pool metric at max
Resolution: Investigate long-running queries:

SELECT pid, duration, query FROM pg_stat_activity WHERE state = 'active';

Escalation

If unresolved after 30 minutes, escalate to: [secondary on-call]
Payment team: [contact]


### Reducing toil

Toil is repetitive, manual operational work that does not produce lasting value. Measuring and reducing toil is a core SRE practice:

- **Automated alert response**: for alerts with a known, safe automated fix (e.g., cache flush, pod restart), use runbook automation (PagerDuty Runbook Automation, AWS Systems Manager Automation) to perform the initial response automatically
- **Track pages per rotation**: high page rates per engineer per week (> 5 pages per shift) are a signal of systemic alerting or reliability issues — treat it as a bug
- **Post-incident toil tracking**: after each incident, note manual steps that could be automated in the next sprint

### Post-incident process

After every P1/P2 incident:
1. **Timeline document**: what happened, chronologically
2. **5 Whys** or equivalent root cause analysis
3. **Action items**: concrete tasks to prevent recurrence
4. **Blameless review**: share findings with the team, not attributed to individuals

Track open action items from incidents as first-class work — deprioritising them is a silent acceptance of future recurrence.

### Review checklist

- [ ] Every production alert has a documented runbook
- [ ] On-call rotation has at least one backup (secondary)
- [ ] Alert threshold review conducted quarterly — false-positive alerts are fixed or removed
- [ ] Pages per shift are tracked; > 5 actionable pages/shift triggers a reliability review
- [ ] Engineers have time during business hours to work on reliability improvements from incidents
- [ ] Manager escalation path exists for major incidents

On-Call Best Practices

On-Call Best Practices

Tags

Summary

Rationale

Alerting quality directly determines on-call quality

Engineers who own what they build make better systems

Guidance

On-call rotation structure

Alert quality standards

Severity levels

Runbook structure

Database connection exhaustion

Escalation