PushBackLog

Engineering Metrics

Soft enforcement Stub by PushBackLog team
Topic: management Topic: metrics Topic: delivery Skillset: management Technology: generic Stage: planning Stage: operations

Engineering Metrics

Status: Stub Category: Management Default enforcement: Soft Author: PushBackLog team


Tags

  • Topic: management, metrics, delivery
  • Skillset: management
  • Technology: generic
  • Stage: planning, operations

Summary

Engineering metrics are quantitative signals used to understand and improve the effectiveness of software delivery. When chosen carefully, metrics expose bottlenecks, predict reliability risk, and guide improvement. When chosen carelessly — or used as individual performance measures — they are gamed, misleading, and harmful to team culture.


Rationale

You cannot improve what you do not measure. But you will also not improve what you measure badly. Engineering metrics provide a feedback loop between team practices and delivery outcomes — but only if the team measures outcomes (deployment frequency, lead time, incident recovery) rather than proxies (lines of code, story points closed, tickets resolved).

The DORA research programme (DevOps Research and Assessment) has produced the most robust empirical evidence for which metrics predict high-performing software delivery. DORA’s four key metrics are the recommended starting point for any engineering metrics programme.


Guidance

DORA’s four key metrics

MetricWhat it measuresTop performer benchmark
Deployment frequencyHow often code is deployed to productionOn-demand (multiple times per day)
Lead time for changesTime from commit to productionLess than one hour
Change failure ratePercentage of deployments causing incidents / rollbacks0–15%
Time to restore service (MTTR)How long to recover from a production incidentLess than one hour

These four metrics form two pairs: the throughput metrics (deployment frequency + lead time) and the stability metrics (change failure rate + MTTR). Elite teams score well on all four simultaneously. The data shows throughput and stability are not in tension — high-performing teams achieve both.

Tracking and baselines

Before optimising, establish baselines. Collect at least 90 days of data before drawing conclusions about trends. Metrics without baselines enable only relative comparisons (“better or worse than last sprint”) but not absolute ones (“are we high-performing relative to industry benchmarks?”).

Measurement tooling should be automated and continuous, not collected manually — manual data collection introduces bias and is unsustainable.

Additional signals

Beyond DORA, other signals can expose specific problem areas:

SignalWhat it detects
Test coverage trendWhether test confidence is growing or eroding
Build/pipeline durationCI feedback loop quality
Alert noise ratio (alerts page per actionable alert)On-call sustainability
P90/P99 latency trendUser-visible performance degradation
Escaped defect rateDefects reaching production that should have been caught earlier

Goodhart’s Law and gaming

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law

Metrics used to evaluate individual engineers or teams produce gaming:

  • Deployment frequency is inflated by deploying no-ops
  • Story points closed increases by inflating estimates
  • Test coverage improves by adding trivial tests that do not validate behaviour

Engineering metrics should be used by the team for self-diagnosis, not by management to score individuals. Publish metrics at the team and organisational level; do not attach them to performance reviews.

Reviewing metrics

A monthly or quarterly metrics review with the team serves several purposes:

  • Identifies genuine bottlenecks (slow CI, high change failure rate in one service)
  • Celebrates genuine improvements
  • Keeps the metric set fresh — stop measuring things that are no longer informative
  • Grounds improvement initiatives in data rather than intuition

Common failure modes

FailureDescription
Proxy metrics as goalsStory points, lines of code, and test count used as performance targets
Individual-level measurementMetrics used to evaluate or rank individual engineers
No baselinesTeams track metrics but have no context for whether numbers are good or bad
Data collected manuallyManual collection is biased, inconsistent, and abandoned under pressure
Vanity metricsNumbers that look good but do not reflect delivery health (e.g., total commits)

Part of the PushBackLog Best Practices Library