Problem Statement
Explain Canary release implementation including traffic routing, metrics monitoring, automated promotion/rollback, and integration with service mesh.
Explanation
Canary releases progressively roll out new version monitoring metrics to detect issues early. Implementation requires: traffic routing capability, comprehensive monitoring, automated decision making, rollback mechanism.
Traffic routing methods:
1. Load balancer weight-based routing:
```yaml
# AWS ALB target groups
resource "aws_lb_listener_rule" "canary" {
listener_arn = aws_lb_listener.main.arn
action {
type = "forward"
forward {
target_group {
arn = aws_lb_target_group.stable.arn
weight = 90
}
target_group {
arn = aws_lb_target_group.canary.arn
weight = 10
}
}
}
}
```
2. Kubernetes with Flagger (service mesh):
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
```
3. Istio traffic splitting:
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp
subset: canary
- route:
- destination:
host: myapp
subset: stable
weight: 90
- destination:
host: myapp
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
```
Metrics monitoring for canary analysis:
```python
class CanaryMetrics:
def __init__(self, stable_version, canary_version):
self.stable = stable_version
self.canary = canary_version
def analyze(self):
metrics = {
'error_rate': self.compare_error_rate(),
'latency_p95': self.compare_latency(),
'latency_p99': self.compare_latency(percentile=99),
'throughput': self.compare_throughput(),
'cpu_usage': self.compare_cpu(),
'memory_usage': self.compare_memory()
}
# Business metrics
business_metrics = {
'conversion_rate': self.compare_conversion(),
'cart_abandonment': self.compare_abandonment(),
'avg_order_value': self.compare_aov()
}
return self.evaluate(metrics, business_metrics)
def compare_error_rate(self):
stable_errors = self.get_metric('error_rate', self.stable)
canary_errors = self.get_metric('error_rate', self.canary)
# Canary error rate should not exceed stable by more than 1%
threshold = stable_errors + 1.0
return {
'stable': stable_errors,
'canary': canary_errors,
'threshold': threshold,
'passed': canary_errors <= threshold
}
def evaluate(self, metrics, business_metrics):
failed = []
for name, result in {**metrics, **business_metrics}.items():
if not result['passed']:
failed.append(f"{name}: {result['canary']} vs {result['stable']}")
if failed:
return {'decision': 'ROLLBACK', 'reasons': failed}
else:
return {'decision': 'PROMOTE', 'reasons': []}
```
Automated canary progression:
```bash
#!/bin/bash
# canary-deploy.sh
VERSION=$1
STAGES=(10 25 50 75 100)
function deploy_canary() {
local weight=$1
echo "Setting canary weight to $weight%"
kubectl patch virtualservice myapp --type=json -p='[
{"op": "replace", "path": "/spec/http/0/route/0/weight", "value": '$((100-weight))'},
{"op": "replace", "path": "/spec/http/0/route/1/weight", "value": '$weight'}
]'
}
function analyze_canary() {
python3 analyze-canary.py --stable=stable --canary=canary --duration=5m
return $?
}
function rollback() {
echo "Rolling back canary"
deploy_canary 0
kubectl delete deployment myapp-canary
exit 1
}
# Deploy canary deployment
kubectl apply -f k8s/canary-deployment.yaml
# Progressive rollout
for stage in "${STAGES[@]}"; do
echo "\n=== Canary stage: $stage% ==="
deploy_canary $stage
echo "Monitoring for 5 minutes..."
if ! analyze_canary; then
echo "Canary analysis failed"
rollback
fi
if [ $stage -eq 100 ]; then
echo "Canary fully deployed"
kubectl delete deployment myapp-stable
kubectl apply -f k8s/stable-deployment.yaml
fi
done
```
Flagger automated canary:
```yaml
# Flagger automatically handles progressive rollout
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
progressDeadlineSeconds: 600
service:
port: 80
analysis:
interval: 1m
threshold: 5
iterations: 10
match:
- headers:
x-canary:
exact: "insider"
metrics:
- name: error-rate
templateRef:
name: error-rate
thresholdRange:
max: 1
- name: latency
templateRef:
name: latency
thresholdRange:
max: 500
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'test' http://myapp-canary/token | grep token"
- name: load-test
type: rollout
url: http://flagger-loadtester/
metadata:
cmd: "hey -z 2m -q 10 -c 2 http://myapp-canary/"
```
Monitoring dashboard configuration:
```json
{
"dashboard": {
"title": "Canary Deployment",
"panels": [
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\", version=\"stable\"}[5m])) / sum(rate(http_requests_total{version=\"stable\"}[5m])) * 100",
"legendFormat": "Stable"
},
{
"expr": "sum(rate(http_requests_total{status=~\"5..\", version=\"canary\"}[5m])) / sum(rate(http_requests_total{version=\"canary\"}[5m])) * 100",
"legendFormat": "Canary"
}
]
},
{
"title": "Latency P95",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{version=\"stable\"}[5m])) by (le))",
"legendFormat": "Stable"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{version=\"canary\"}[5m])) by (le))",
"legendFormat": "Canary"
}
]
}
]
}
}
```
CI/CD integration:
```yaml
stages:
- build
- deploy-canary
- analyze
- promote
deploy-canary:
stage: deploy-canary
script:
- kubectl apply -f k8s/canary/
- kubectl annotate canary/myapp flagger.app/trigger="$CI_COMMIT_SHA"
only:
- main
monitor-canary:
stage: analyze
script:
- ./wait-for-canary.sh myapp
timeout: 30m
promote-canary:
stage: promote
script:
- kubectl get canary myapp -o json | jq '.status.phase'
when: on_success
```
Best practices: start with small percentage (5-10%), monitor comprehensive metrics (technical and business), implement automated analysis, use service mesh for advanced routing, test rollback procedures, implement gradual rollout stages, use statistical significance testing, combine with feature flags for additional control, maintain observability throughout rollout. Understanding canary releases enables safe, data-driven progressive deployments.