Dashboard Design
Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: observability, monitoring
- Skillset: devops, engineering, engineering-management
- Technology: Grafana, CloudWatch, Datadog
- Stage: operations
Summary
An observability dashboard communicates the health of a system at a glance to its intended audience. Effective dashboards are designed around specific questions — “is this service healthy right now?” or “why is latency elevated?” — not as an exhaustive collection of every available metric. Decorative dashboards that display many metrics but communicate nothing actionable are common and counterproductive; they create the illusion of observability without the substance.
Rationale
Dashboards answer specific questions for specific audiences
A dashboard designed for a senior engineer debugging a performance issue looks very different from a dashboard designed for a CTO reviewing monthly reliability trends. Mixing audiences or purposes in one dashboard produces a dashboard that serves neither well. The first design decision for any dashboard should be: who will use this, and what question does it answer?
The golden signals provide a universal starting point
Google’s four golden signals (Latency, Traffic, Errors, Saturation) apply to almost every service. A service health dashboard built on these four signals is actionable, universally understood, and sufficient to detect the vast majority of production issues. Start with the golden signals before adding supporting metrics.
Guidance
The four golden signals
| Signal | What it measures | Typical metric |
|---|---|---|
| Latency | How long requests take | p50, p95, p99 response time |
| Traffic | Volume of requests | Requests per second |
| Errors | Rate of failed requests | 5xx rate, error rate by type |
| Saturation | How “full” the system is | CPU %, memory %, connection pool % |
A service health dashboard that shows these four signals for a service will answer “is this service healthy?” for any engineer.
Dashboard hierarchy
Structure dashboards in layers:
Level 1: Business health (for leadership)
- Active users, transactions per minute, revenue indicators, SLO status
Level 2: Service health (for on-call, engineers)
- Golden signals per service, SLO burn rate, alert state
Level 3: Deep-dive (for debugging)
- Internal metrics, per-endpoint breakdown, trace samples, DB query times
Link dashboards: a Level 2 dashboard in a degraded state should link directly to the relevant Level 3 dashboards. Engineers should be able to go from “something is wrong” to “here is why” by following dashboard links, not by remembering query syntax.
Grafana dashboard best practices
// Panel design principles
// 1. Always set panel title + description
{
"title": "API p95 Latency",
"description": "95th percentile response time for all API endpoints. Alert threshold: 500ms.",
}
// 2. Set sensible Y-axis bounds — don't autoscale latency from 0
{
"yaxes": [
{ "min": 0, "max": 1000, "format": "ms" }
]
}
// 3. Mark thresholds — show alert lines on graphs
{
"thresholds": [
{ "value": 500, "color": "yellow", "op": "gt" },
{ "value": 1000, "color": "red", "op": "gt" }
]
}
Common dashboard anti-patterns
| Anti-pattern | Problem | Fix |
|---|---|---|
| Decorative metrics | Metrics that are always green and never alert; provide false confidence | Remove or replace with metrics that reflect real health |
| Auto-scaling axes | A latency spike from 10ms to 100ms looks flat because the scale adjusts | Set fixed Y-axis ranges |
| Too many panels | Engineers scan but don’t read; important signals are missed | Keep service health dashboards to < 20 panels |
| No alert thresholds | Engineers can’t tell what “healthy” vs “degraded” looks like | Add threshold lines matching alert config |
| Unknown owners | Nobody knows who to ask about a metric | Add Team annotations; link to runbooks |
| Stale dashboards | Panels show metrics from services that no longer exist | Review and prune dashboards quarterly |
SLO burn rate dashboard
A burn rate panel is more actionable than a raw error rate panel:
# Error budget burn rate (1 hour window)
# Shows how fast the error budget is being consumed relative to target
(
sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001 # 0.001 = 1 - SLO target (e.g., 99.9%)
A burn rate > 1 means the budget is being consumed faster than expected. A burn rate > 14.4 means the monthly error budget will be exhausted in 5 days — an immediate P1 trigger.
Audience-specific design
On-call engineer dashboard:
- Big, prominent current status (RAG indicators)
- Recent deployments timeline
- Alert state panel
- Links to runbooks
Engineering team dashboard:
- Error budget status per service (for SLO tracking)
- Deployment frequency and MTTR trends (DORA metrics)
- Key business KPIs
Leadership dashboard:
- SLO compliance (availability, latency SLO) over the month
- Business metrics (transactions, active users)
- Incident count and MTTR trends
- Availability % vs target
Review checklist
- Every dashboard has a stated purpose and named audience
- Service health dashboards include all four golden signals
- Alert thresholds are marked on time-series panels
- Dashboard links to related dashboards and runbooks
- Y-axis ranges are fixed for SLI metrics — not auto-scaled
- Dashboards are versioned in code (Grafana-as-code, Terraform) — not manually maintained
- Stale/unused dashboards are reviewed and removed quarterly