PushBackLog

Dashboard Design

Advisory enforcement Complete by PushBackLog team
Topic: observability Topic: monitoring Skillset: devops Skillset: engineering Skillset: engineering-management Technology: Grafana Technology: CloudWatch Technology: Datadog Stage: operations

Dashboard Design

Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: observability, monitoring
  • Skillset: devops, engineering, engineering-management
  • Technology: Grafana, CloudWatch, Datadog
  • Stage: operations

Summary

An observability dashboard communicates the health of a system at a glance to its intended audience. Effective dashboards are designed around specific questions — “is this service healthy right now?” or “why is latency elevated?” — not as an exhaustive collection of every available metric. Decorative dashboards that display many metrics but communicate nothing actionable are common and counterproductive; they create the illusion of observability without the substance.


Rationale

Dashboards answer specific questions for specific audiences

A dashboard designed for a senior engineer debugging a performance issue looks very different from a dashboard designed for a CTO reviewing monthly reliability trends. Mixing audiences or purposes in one dashboard produces a dashboard that serves neither well. The first design decision for any dashboard should be: who will use this, and what question does it answer?

The golden signals provide a universal starting point

Google’s four golden signals (Latency, Traffic, Errors, Saturation) apply to almost every service. A service health dashboard built on these four signals is actionable, universally understood, and sufficient to detect the vast majority of production issues. Start with the golden signals before adding supporting metrics.


Guidance

The four golden signals

SignalWhat it measuresTypical metric
LatencyHow long requests takep50, p95, p99 response time
TrafficVolume of requestsRequests per second
ErrorsRate of failed requests5xx rate, error rate by type
SaturationHow “full” the system isCPU %, memory %, connection pool %

A service health dashboard that shows these four signals for a service will answer “is this service healthy?” for any engineer.

Dashboard hierarchy

Structure dashboards in layers:

Level 1: Business health (for leadership)
  - Active users, transactions per minute, revenue indicators, SLO status

Level 2: Service health (for on-call, engineers)
  - Golden signals per service, SLO burn rate, alert state

Level 3: Deep-dive (for debugging)
  - Internal metrics, per-endpoint breakdown, trace samples, DB query times

Link dashboards: a Level 2 dashboard in a degraded state should link directly to the relevant Level 3 dashboards. Engineers should be able to go from “something is wrong” to “here is why” by following dashboard links, not by remembering query syntax.

Grafana dashboard best practices

// Panel design principles

// 1. Always set panel title + description
{
  "title": "API p95 Latency",
  "description": "95th percentile response time for all API endpoints. Alert threshold: 500ms.",
}

// 2. Set sensible Y-axis bounds — don't autoscale latency from 0
{
  "yaxes": [
    { "min": 0, "max": 1000, "format": "ms" }
  ]
}

// 3. Mark thresholds — show alert lines on graphs
{
  "thresholds": [
    { "value": 500, "color": "yellow", "op": "gt" },
    { "value": 1000, "color": "red", "op": "gt" }
  ]
}

Common dashboard anti-patterns

Anti-patternProblemFix
Decorative metricsMetrics that are always green and never alert; provide false confidenceRemove or replace with metrics that reflect real health
Auto-scaling axesA latency spike from 10ms to 100ms looks flat because the scale adjustsSet fixed Y-axis ranges
Too many panelsEngineers scan but don’t read; important signals are missedKeep service health dashboards to < 20 panels
No alert thresholdsEngineers can’t tell what “healthy” vs “degraded” looks likeAdd threshold lines matching alert config
Unknown ownersNobody knows who to ask about a metricAdd Team annotations; link to runbooks
Stale dashboardsPanels show metrics from services that no longer existReview and prune dashboards quarterly

SLO burn rate dashboard

A burn rate panel is more actionable than a raw error rate panel:

# Error budget burn rate (1 hour window)
# Shows how fast the error budget is being consumed relative to target
(
  sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001  # 0.001 = 1 - SLO target (e.g., 99.9%)

A burn rate > 1 means the budget is being consumed faster than expected. A burn rate > 14.4 means the monthly error budget will be exhausted in 5 days — an immediate P1 trigger.

Audience-specific design

On-call engineer dashboard:

  • Big, prominent current status (RAG indicators)
  • Recent deployments timeline
  • Alert state panel
  • Links to runbooks

Engineering team dashboard:

  • Error budget status per service (for SLO tracking)
  • Deployment frequency and MTTR trends (DORA metrics)
  • Key business KPIs

Leadership dashboard:

  • SLO compliance (availability, latency SLO) over the month
  • Business metrics (transactions, active users)
  • Incident count and MTTR trends
  • Availability % vs target

Review checklist

  • Every dashboard has a stated purpose and named audience
  • Service health dashboards include all four golden signals
  • Alert thresholds are marked on time-series panels
  • Dashboard links to related dashboards and runbooks
  • Y-axis ranges are fixed for SLI metrics — not auto-scaled
  • Dashboards are versioned in code (Grafana-as-code, Terraform) — not manually maintained
  • Stale/unused dashboards are reviewed and removed quarterly