Engineering Metrics
Status: Stub Category: Management Default enforcement: Soft Author: PushBackLog team
Tags
- Topic: management, metrics, delivery
- Skillset: management
- Technology: generic
- Stage: planning, operations
Summary
Engineering metrics are quantitative signals used to understand and improve the effectiveness of software delivery. When chosen carefully, metrics expose bottlenecks, predict reliability risk, and guide improvement. When chosen carelessly — or used as individual performance measures — they are gamed, misleading, and harmful to team culture.
Rationale
You cannot improve what you do not measure. But you will also not improve what you measure badly. Engineering metrics provide a feedback loop between team practices and delivery outcomes — but only if the team measures outcomes (deployment frequency, lead time, incident recovery) rather than proxies (lines of code, story points closed, tickets resolved).
The DORA research programme (DevOps Research and Assessment) has produced the most robust empirical evidence for which metrics predict high-performing software delivery. DORA’s four key metrics are the recommended starting point for any engineering metrics programme.
Guidance
DORA’s four key metrics
| Metric | What it measures | Top performer benchmark |
|---|---|---|
| Deployment frequency | How often code is deployed to production | On-demand (multiple times per day) |
| Lead time for changes | Time from commit to production | Less than one hour |
| Change failure rate | Percentage of deployments causing incidents / rollbacks | 0–15% |
| Time to restore service (MTTR) | How long to recover from a production incident | Less than one hour |
These four metrics form two pairs: the throughput metrics (deployment frequency + lead time) and the stability metrics (change failure rate + MTTR). Elite teams score well on all four simultaneously. The data shows throughput and stability are not in tension — high-performing teams achieve both.
Tracking and baselines
Before optimising, establish baselines. Collect at least 90 days of data before drawing conclusions about trends. Metrics without baselines enable only relative comparisons (“better or worse than last sprint”) but not absolute ones (“are we high-performing relative to industry benchmarks?”).
Measurement tooling should be automated and continuous, not collected manually — manual data collection introduces bias and is unsustainable.
Additional signals
Beyond DORA, other signals can expose specific problem areas:
| Signal | What it detects |
|---|---|
| Test coverage trend | Whether test confidence is growing or eroding |
| Build/pipeline duration | CI feedback loop quality |
| Alert noise ratio (alerts page per actionable alert) | On-call sustainability |
| P90/P99 latency trend | User-visible performance degradation |
| Escaped defect rate | Defects reaching production that should have been caught earlier |
Goodhart’s Law and gaming
“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law
Metrics used to evaluate individual engineers or teams produce gaming:
- Deployment frequency is inflated by deploying no-ops
- Story points closed increases by inflating estimates
- Test coverage improves by adding trivial tests that do not validate behaviour
Engineering metrics should be used by the team for self-diagnosis, not by management to score individuals. Publish metrics at the team and organisational level; do not attach them to performance reviews.
Reviewing metrics
A monthly or quarterly metrics review with the team serves several purposes:
- Identifies genuine bottlenecks (slow CI, high change failure rate in one service)
- Celebrates genuine improvements
- Keeps the metric set fresh — stop measuring things that are no longer informative
- Grounds improvement initiatives in data rather than intuition
Common failure modes
| Failure | Description |
|---|---|
| Proxy metrics as goals | Story points, lines of code, and test count used as performance targets |
| Individual-level measurement | Metrics used to evaluate or rank individual engineers |
| No baselines | Teams track metrics but have no context for whether numbers are good or bad |
| Data collected manually | Manual collection is biased, inconsistent, and abandoned under pressure |
| Vanity metrics | Numbers that look good but do not reflect delivery health (e.g., total commits) |