Dashboard Design

Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team

Summary

An observability dashboard communicates the health of a system at a glance to its intended audience. Effective dashboards are designed around specific questions — “is this service healthy right now?” or “why is latency elevated?” — not as an exhaustive collection of every available metric. Decorative dashboards that display many metrics but communicate nothing actionable are common and counterproductive; they create the illusion of observability without the substance.

Rationale

Dashboards answer specific questions for specific audiences

A dashboard designed for a senior engineer debugging a performance issue looks very different from a dashboard designed for a CTO reviewing monthly reliability trends. Mixing audiences or purposes in one dashboard produces a dashboard that serves neither well. The first design decision for any dashboard should be: who will use this, and what question does it answer?

The golden signals provide a universal starting point

Google’s four golden signals (Latency, Traffic, Errors, Saturation) apply to almost every service. A service health dashboard built on these four signals is actionable, universally understood, and sufficient to detect the vast majority of production issues. Start with the golden signals before adding supporting metrics.

Guidance

The four golden signals

Signal	What it measures	Typical metric
Latency	How long requests take	p50, p95, p99 response time
Traffic	Volume of requests	Requests per second
Errors	Rate of failed requests	5xx rate, error rate by type
Saturation	How “full” the system is	CPU %, memory %, connection pool %

A service health dashboard that shows these four signals for a service will answer “is this service healthy?” for any engineer.

Dashboard hierarchy

Structure dashboards in layers:

Level 1: Business health (for leadership)
  - Active users, transactions per minute, revenue indicators, SLO status

Level 2: Service health (for on-call, engineers)
  - Golden signals per service, SLO burn rate, alert state

Level 3: Deep-dive (for debugging)
  - Internal metrics, per-endpoint breakdown, trace samples, DB query times

Link dashboards: a Level 2 dashboard in a degraded state should link directly to the relevant Level 3 dashboards. Engineers should be able to go from “something is wrong” to “here is why” by following dashboard links, not by remembering query syntax.

Grafana dashboard best practices

// Panel design principles

// 1. Always set panel title + description
{
  "title": "API p95 Latency",
  "description": "95th percentile response time for all API endpoints. Alert threshold: 500ms.",
}

// 2. Set sensible Y-axis bounds — don't autoscale latency from 0
{
  "yaxes": [
    { "min": 0, "max": 1000, "format": "ms" }
  ]
}

// 3. Mark thresholds — show alert lines on graphs
{
  "thresholds": [
    { "value": 500, "color": "yellow", "op": "gt" },
    { "value": 1000, "color": "red", "op": "gt" }
  ]
}

Common dashboard anti-patterns

Anti-pattern	Problem	Fix
Decorative metrics	Metrics that are always green and never alert; provide false confidence	Remove or replace with metrics that reflect real health
Auto-scaling axes	A latency spike from 10ms to 100ms looks flat because the scale adjusts	Set fixed Y-axis ranges
Too many panels	Engineers scan but don’t read; important signals are missed	Keep service health dashboards to < 20 panels
No alert thresholds	Engineers can’t tell what “healthy” vs “degraded” looks like	Add threshold lines matching alert config
Unknown owners	Nobody knows who to ask about a metric	Add `Team` annotations; link to runbooks
Stale dashboards	Panels show metrics from services that no longer exist	Review and prune dashboards quarterly

SLO burn rate dashboard

A burn rate panel is more actionable than a raw error rate panel:

# Error budget burn rate (1 hour window)
# Shows how fast the error budget is being consumed relative to target
(
  sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001  # 0.001 = 1 - SLO target (e.g., 99.9%)

A burn rate > 1 means the budget is being consumed faster than expected. A burn rate > 14.4 means the monthly error budget will be exhausted in 5 days — an immediate P1 trigger.

Audience-specific design

On-call engineer dashboard:

Big, prominent current status (RAG indicators)
Recent deployments timeline
Alert state panel
Links to runbooks

Engineering team dashboard:

Error budget status per service (for SLO tracking)
Deployment frequency and MTTR trends (DORA metrics)
Key business KPIs

Leadership dashboard:

SLO compliance (availability, latency SLO) over the month
Business metrics (transactions, active users)
Incident count and MTTR trends
Availability % vs target

Review checklist

Every dashboard has a stated purpose and named audience
Service health dashboards include all four golden signals
Alert thresholds are marked on time-series panels
Dashboard links to related dashboards and runbooks
Y-axis ranges are fixed for SLI metrics — not auto-scaled
Dashboards are versioned in code (Grafana-as-code, Terraform) — not manually maintained
Stale/unused dashboards are reviewed and removed quarterly

Dashboard Design

Dashboard Design

Tags

Summary

Rationale

Dashboards answer specific questions for specific audiences

The golden signals provide a universal starting point

Guidance

The four golden signals

Dashboard hierarchy

Grafana dashboard best practices

Common dashboard anti-patterns

SLO burn rate dashboard

Audience-specific design

Review checklist