PushBackLog

Canary Releases

Advisory enforcement Complete by PushBackLog team
Topic: infrastructure Topic: delivery Topic: reliability Skillset: devops Skillset: engineering Technology: generic Stage: delivery Stage: operations

Canary Releases

Status: Complete
Category: Infrastructure
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: infrastructure, delivery, reliability
  • Skillset: devops, engineering
  • Technology: generic
  • Stage: delivery, operations

Summary

A canary release routes a small percentage of production traffic to a new version of a service before promoting it to serve all users. Named after the “canary in a coal mine” — a small, early-warning signal — the practice uses a real subset of production traffic to validate the new version under real conditions, while limiting the blast radius of any failure to a fraction of users. Canary releases bridge the gap between blue/green deployments (instant full switch) and continuous deployment (full rollout immediately).


Rationale

Staged confidence reduces risk

No amount of staging environment testing perfectly replicates production traffic patterns, user behaviour, and data. Some failures only surface under specific conditions that occur infrequently in synthetic tests but regularly in production. A canary release exposes the new version to real production conditions at limited scale — catching these issues before they affect all users.

Real traffic beats synthetic tests

A canary receiving 5% of production traffic will encounter edge cases, unusual request patterns, and production data shapes that a staging environment with fabricated test data never will. The canary window provides confidence that is qualitatively different from pre-production testing.


Guidance

Canary release process

    All traffic → v1.4.2 (stable)

           │ Deploy v1.5.0

    5% traffic → v1.5.0 (canary)
   95% traffic → v1.4.2 (stable)

           │ Monitor for 15-30 min

    25% → v1.5.0
    75% → v1.4.2

           │ Monitor for 30-60 min

   100% → v1.5.0 (promote)
   v1.4.2 decommissioned

Automatically promote if metrics are healthy. Automatically roll back if metrics exceed error thresholds.

Kubernetes: traffic splitting with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5          # 5% to canary
        - pause: {duration: 15m} # Monitor for 15 minutes
        - setWeight: 25
        - pause: {duration: 30m}
        - setWeight: 50
        - pause: {duration: 30m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: api-service-canary

AWS: weighted target groups

# Shift 10% to the new target group
aws elbv2 modify-rule \
  --rule-arn $RULE_ARN \
  --actions '[
    {
      "Type": "forward",
      "ForwardConfig": {
        "TargetGroups": [
          {"TargetGroupArn": "'$STABLE_TG_ARN'", "Weight": 90},
          {"TargetGroupArn": "'$CANARY_TG_ARN'", "Weight": 10}
        ]
      }
    }
  ]'

Key metrics to monitor during canary

MetricThreshold exampleWhat it catches
5xx error rate< 0.1% (or same as stable)Unhandled exceptions, downstream failures
p99 latency≤ 120% of stablePerformance regressions
4xx error rateNo significant increaseValidation regressions, breaking changes
Business KPIsNo statistically significant changeFunctional regressions (conversions, add-to-cart, etc.)

Automated rollback triggers based on these metrics:

# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 5m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{job="{{args.service-name}}",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="{{args.service-name}}"}[5m]))

Target selection strategies

Not all 5% of users need to be random. You can target:

  • Cookie/header-based: specific users opted in to early access
  • User-ID modulo: users with ID % 20 == 0 get the canary
  • Geography-based: one region gets the canary first
  • Internal users: employees before external customers

Targeted canaries allow testing with a group whose experience you can monitor closely (e.g., your own employees).

Canary vs. blue/green

FactorCanaryBlue/Green
Blast radiusFraction of usersAll users at once
Rollback speedTraffic re-weighting (seconds)Load balancer switch (seconds)
Infrastructure costProportional to traffic split~2× cost during switch
ComplexityHigherMedium
Real-traffic validationYes — before full rolloutNo — full traffic immediately

Review checklist

  • Canary percentage starts small (1–10%) for high-risk changes
  • Automated analysis evaluates error rate and latency against stable baseline
  • Automated rollback triggers when analysis fails
  • Business KPIs are monitored alongside technical metrics
  • Duration at each step is appropriate for traffic volume (low-traffic services may need longer windows for statistical significance)