Canary Releases
Status: Complete
Category: Infrastructure
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: infrastructure, delivery, reliability
- Skillset: devops, engineering
- Technology: generic
- Stage: delivery, operations
Summary
A canary release routes a small percentage of production traffic to a new version of a service before promoting it to serve all users. Named after the “canary in a coal mine” — a small, early-warning signal — the practice uses a real subset of production traffic to validate the new version under real conditions, while limiting the blast radius of any failure to a fraction of users. Canary releases bridge the gap between blue/green deployments (instant full switch) and continuous deployment (full rollout immediately).
Rationale
Staged confidence reduces risk
No amount of staging environment testing perfectly replicates production traffic patterns, user behaviour, and data. Some failures only surface under specific conditions that occur infrequently in synthetic tests but regularly in production. A canary release exposes the new version to real production conditions at limited scale — catching these issues before they affect all users.
Real traffic beats synthetic tests
A canary receiving 5% of production traffic will encounter edge cases, unusual request patterns, and production data shapes that a staging environment with fabricated test data never will. The canary window provides confidence that is qualitatively different from pre-production testing.
Guidance
Canary release process
All traffic → v1.4.2 (stable)
│
│ Deploy v1.5.0
▼
5% traffic → v1.5.0 (canary)
95% traffic → v1.4.2 (stable)
│
│ Monitor for 15-30 min
▼
25% → v1.5.0
75% → v1.4.2
│
│ Monitor for 30-60 min
▼
100% → v1.5.0 (promote)
v1.4.2 decommissioned
Automatically promote if metrics are healthy. Automatically roll back if metrics exceed error thresholds.
Kubernetes: traffic splitting with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% to canary
- pause: {duration: 15m} # Monitor for 15 minutes
- setWeight: 25
- pause: {duration: 30m}
- setWeight: 50
- pause: {duration: 30m}
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: api-service-canary
AWS: weighted target groups
# Shift 10% to the new target group
aws elbv2 modify-rule \
--rule-arn $RULE_ARN \
--actions '[
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"TargetGroupArn": "'$STABLE_TG_ARN'", "Weight": 90},
{"TargetGroupArn": "'$CANARY_TG_ARN'", "Weight": 10}
]
}
}
]'
Key metrics to monitor during canary
| Metric | Threshold example | What it catches |
|---|---|---|
| 5xx error rate | < 0.1% (or same as stable) | Unhandled exceptions, downstream failures |
| p99 latency | ≤ 120% of stable | Performance regressions |
| 4xx error rate | No significant increase | Validation regressions, breaking changes |
| Business KPIs | No statistically significant change | Functional regressions (conversions, add-to-cart, etc.) |
Automated rollback triggers based on these metrics:
# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 5m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{job="{{args.service-name}}",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="{{args.service-name}}"}[5m]))
Target selection strategies
Not all 5% of users need to be random. You can target:
- Cookie/header-based: specific users opted in to early access
- User-ID modulo: users with ID % 20 == 0 get the canary
- Geography-based: one region gets the canary first
- Internal users: employees before external customers
Targeted canaries allow testing with a group whose experience you can monitor closely (e.g., your own employees).
Canary vs. blue/green
| Factor | Canary | Blue/Green |
|---|---|---|
| Blast radius | Fraction of users | All users at once |
| Rollback speed | Traffic re-weighting (seconds) | Load balancer switch (seconds) |
| Infrastructure cost | Proportional to traffic split | ~2× cost during switch |
| Complexity | Higher | Medium |
| Real-traffic validation | Yes — before full rollout | No — full traffic immediately |
Review checklist
- Canary percentage starts small (1–10%) for high-risk changes
- Automated analysis evaluates error rate and latency against stable baseline
- Automated rollback triggers when analysis fails
- Business KPIs are monitored alongside technical metrics
- Duration at each step is appropriate for traffic volume (low-traffic services may need longer windows for statistical significance)