PushBackLog

On-Call Best Practices

Soft enforcement Complete by PushBackLog team
Topic: observability Topic: operations Topic: reliability Skillset: devops Skillset: engineering Skillset: engineering-management Technology: PagerDuty Technology: OpsGenie Stage: operations

On-Call Best Practices

Status: Complete
Category: Observability
Default enforcement: Soft
Author: PushBackLog team


Tags

  • Topic: observability, operations, reliability
  • Skillset: devops, engineering, engineering-management
  • Technology: PagerDuty, OpsGenie
  • Stage: operations

Summary

On-call is a structured rotation in which engineers are available outside normal working hours to respond to production incidents. Done well, on-call is a manageable burden that keeps engineers accountable for the reliability of the systems they build. Done poorly — with too many pages, poor tooling, and no support — on-call destroys morale, causes burnout, and increases attrition. The engineering leader’s job is to make on-call sustainable: configure alerting to fire only on real, actionable problems, ensure responders have runbooks and context, and treat every page as a signal about system quality.


Rationale

Alerting quality directly determines on-call quality

The most common on-call failure mode is alert fatigue: too many pages, most of them non-actionable or intermittent, to the point where responders begin ignoring alerts or treating all pages as noise. A single false-positive alert at 3 AM erodes trust in the alerting system. A pattern of false positives trains responders to assume alerts are false — at which point a real incident is missed.

Every alert should be actionable: a human being woken at 3 AM should have exactly one question — “what do I do?” — and the answer should be documented.

Engineers who own what they build make better systems

“You build it, you run it” is the principle that engineers on-call for their own services have a direct incentive to make those services reliable and observable. When operations is a separate team that gets paged for things engineers built, the feedback loop is broken — engineers don’t experience the consequences of their reliability decisions. Teams that are responsible for their own on-call tend to invest in better alerting, more runbooks, and more reliable systems.


Guidance

On-call rotation structure

Primary:   Engineer A  (first responder, 24/7 availability during rotation)
Secondary: Engineer B  (backup if primary doesn't acknowledge in 15 minutes)
Manager:   Sandra W    (escalation for major incidents, context and authority)

Rotation cycle: 1 week
Handoff: Mondays at 09:00 local time

Rotation length of one week is standard. Two-week rotations exist but lead to more fatigue; one-day rotations have too much handoff overhead.

Team size minimum for sustainable on-call:

  • With 1-week rotations: 4+ engineers means each person is on-call every 4+ weeks
  • < 4 engineers on a rotation creates chronic fatigue; delay hiring or merge rotations

Alert quality standards

Alerts should meet the following criteria before being enabled in production:

CriterionDescription
ActionableThe responder must be able to do something in response
AccurateThe alert triggers for real problems, not transient noise
DocumentedA runbook exists that explains how to respond
Appropriate severityP1/page-now vs P2/business-hours distinction is correct
Customer-impactingThe alert represents genuine user impact, not internal implementation detail

An alert that is consistently false-positive, consistently resolved without action, or consistently safe to ignore until morning should be:

  1. Converted to a non-paging ticket; or
  2. Fixed so it only fires when it should; or
  3. Deleted

Severity levels

LevelCriteriaResponse timeCommunication
P1 - CriticalService down, data loss, major revenue impactImmediate (< 5 min)Status page update within 15 min
P2 - HighDegraded service, significant user impact30 minutesStatus page update
P3 - MediumMinor degradation, workaround existsBusiness hoursTicket
P4 - LowNo current user impactScheduledBacklog

Only P1/P2 should page on-call out of hours.

Runbook structure

Every production alert must have a linked runbook:

# [Alert Name] Runbook

## Trigger condition
High error rate on POST /api/payments — 5xx rate > 1% for 5 minutes

## User impact
Users unable to complete purchases. Revenue impact: ~$X/minute.

## First steps (under 2 minutes)
1. Check dashboard: [link]
2. Check recent deployments: [link]
3. Check downstream dependency status: [link]

## Common causes and resolution

### Stripe API timeout
**Symptom**: High share of `PaymentGatewayTimeoutError` in logs  
**Resolution**: Check Stripe status page. If degraded, enable fallback to secondary processor:
```bash
feature-flags set stripe-primary-enabled false

Database connection exhaustion

Symptom: PoolExhaustedError in logs, DB pool metric at max
Resolution: Investigate long-running queries:

SELECT pid, duration, query FROM pg_stat_activity WHERE state = 'active';

Escalation

  • If unresolved after 30 minutes, escalate to: [secondary on-call]
  • Payment team: [contact]

### Reducing toil

Toil is repetitive, manual operational work that does not produce lasting value. Measuring and reducing toil is a core SRE practice:

- **Automated alert response**: for alerts with a known, safe automated fix (e.g., cache flush, pod restart), use runbook automation (PagerDuty Runbook Automation, AWS Systems Manager Automation) to perform the initial response automatically
- **Track pages per rotation**: high page rates per engineer per week (> 5 pages per shift) are a signal of systemic alerting or reliability issues — treat it as a bug
- **Post-incident toil tracking**: after each incident, note manual steps that could be automated in the next sprint

### Post-incident process

After every P1/P2 incident:
1. **Timeline document**: what happened, chronologically
2. **5 Whys** or equivalent root cause analysis
3. **Action items**: concrete tasks to prevent recurrence
4. **Blameless review**: share findings with the team, not attributed to individuals

Track open action items from incidents as first-class work — deprioritising them is a silent acceptance of future recurrence.

### Review checklist

- [ ] Every production alert has a documented runbook
- [ ] On-call rotation has at least one backup (secondary)
- [ ] Alert threshold review conducted quarterly — false-positive alerts are fixed or removed
- [ ] Pages per shift are tracked; > 5 actionable pages/shift triggers a reliability review
- [ ] Engineers have time during business hours to work on reliability improvements from incidents
- [ ] Manager escalation path exists for major incidents