Canary Releases

Status: Complete
Category: Infrastructure
Default enforcement: Advisory
Author: PushBackLog team

Summary

A canary release routes a small percentage of production traffic to a new version of a service before promoting it to serve all users. Named after the “canary in a coal mine” — a small, early-warning signal — the practice uses a real subset of production traffic to validate the new version under real conditions, while limiting the blast radius of any failure to a fraction of users. Canary releases bridge the gap between blue/green deployments (instant full switch) and continuous deployment (full rollout immediately).

Rationale

Staged confidence reduces risk

No amount of staging environment testing perfectly replicates production traffic patterns, user behaviour, and data. Some failures only surface under specific conditions that occur infrequently in synthetic tests but regularly in production. A canary release exposes the new version to real production conditions at limited scale — catching these issues before they affect all users.

Real traffic beats synthetic tests

A canary receiving 5% of production traffic will encounter edge cases, unusual request patterns, and production data shapes that a staging environment with fabricated test data never will. The canary window provides confidence that is qualitatively different from pre-production testing.

Guidance

Canary release process

    All traffic → v1.4.2 (stable)
           │
           │ Deploy v1.5.0
           ▼
    5% traffic → v1.5.0 (canary)
   95% traffic → v1.4.2 (stable)
           │
           │ Monitor for 15-30 min
           ▼
    25% → v1.5.0
    75% → v1.4.2
           │
           │ Monitor for 30-60 min
           ▼
   100% → v1.5.0 (promote)
   v1.4.2 decommissioned

Automatically promote if metrics are healthy. Automatically roll back if metrics exceed error thresholds.

Kubernetes: traffic splitting with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5          # 5% to canary
        - pause: {duration: 15m} # Monitor for 15 minutes
        - setWeight: 25
        - pause: {duration: 30m}
        - setWeight: 50
        - pause: {duration: 30m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: api-service-canary

AWS: weighted target groups

# Shift 10% to the new target group
aws elbv2 modify-rule \
  --rule-arn $RULE_ARN \
  --actions '[
    {
      "Type": "forward",
      "ForwardConfig": {
        "TargetGroups": [
          {"TargetGroupArn": "'$STABLE_TG_ARN'", "Weight": 90},
          {"TargetGroupArn": "'$CANARY_TG_ARN'", "Weight": 10}
        ]
      }
    }
  ]'

Key metrics to monitor during canary

Metric	Threshold example	What it catches
5xx error rate	< 0.1% (or same as stable)	Unhandled exceptions, downstream failures
p99 latency	≤ 120% of stable	Performance regressions
4xx error rate	No significant increase	Validation regressions, breaking changes
Business KPIs	No statistically significant change	Functional regressions (conversions, add-to-cart, etc.)

Automated rollback triggers based on these metrics:

# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 5m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{job="{{args.service-name}}",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="{{args.service-name}}"}[5m]))

Target selection strategies

Not all 5% of users need to be random. You can target:

Cookie/header-based: specific users opted in to early access
User-ID modulo: users with ID % 20 == 0 get the canary
Geography-based: one region gets the canary first
Internal users: employees before external customers

Targeted canaries allow testing with a group whose experience you can monitor closely (e.g., your own employees).

Canary vs. blue/green

Factor	Canary	Blue/Green
Blast radius	Fraction of users	All users at once
Rollback speed	Traffic re-weighting (seconds)	Load balancer switch (seconds)
Infrastructure cost	Proportional to traffic split	~2× cost during switch
Complexity	Higher	Medium
Real-traffic validation	Yes — before full rollout	No — full traffic immediately

Review checklist

Canary percentage starts small (1–10%) for high-risk changes
Automated analysis evaluates error rate and latency against stable baseline
Automated rollback triggers when analysis fails
Business KPIs are monitored alongside technical metrics
Duration at each step is appropriate for traffic volume (low-traffic services may need longer windows for statistical significance)

Canary Releases

Canary Releases

Tags

Summary

Rationale

Staged confidence reduces risk

Real traffic beats synthetic tests

Guidance

Canary release process

Kubernetes: traffic splitting with Argo Rollouts

AWS: weighted target groups

Key metrics to monitor during canary

Target selection strategies

Canary vs. blue/green

Review checklist