Canary Deployment

金丝雀部署

Overview

概述

Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.

逐步向小比例用户部署新版本，监控指标以发现问题，并根据预定义阈值自动回滚或继续推进。

When to Use

适用场景

Low-risk gradual rollouts
Real-world testing with live traffic
Automatic rollback on errors
User impact minimization
A/B testing integration
Metrics-driven deployments
High-traffic services

低风险的逐步发布
真实流量下的实际测试
出错时自动回滚
最小化用户影响
集成A/B测试
指标驱动的部署
高流量服务

Implementation Examples

实现示例

1. Istio-based Canary Deployment

1. 基于Istio的金丝雀部署

yaml

undefined

yaml

undefined

canary-deployment-istio.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v1 namespace: production spec: replicas: 3 selector: matchLabels: app: myapp version: v1 template: metadata: labels: app: myapp version: v1 spec: containers: - name: myapp image: myrepo/myapp:1.0.0 ports: - containerPort: 8080

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v2 namespace: production spec: replicas: 1 # Start with minimal replicas for canary selector: matchLabels: app: myapp version: v2 template: metadata: labels: app: myapp version: v2 spec: containers: - name: myapp image: myrepo/myapp:2.0.0 ports: - containerPort: 8080

apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp namespace: production spec: hosts: - myapp http: # Canary: 5% to v2, 95% to v1 - match: - headers: user-agent: regex: ".Chrome." # Test with Chrome route: - destination: host: myapp subset: v2 weight: 100 timeout: 10s

# Default route with traffic split
- route:
    - destination:
        host: myapp
        subset: v1
      weight: 95
    - destination:
        host: myapp
        subset: v2
      weight: 5
  timeout: 10s

apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp namespace: production spec: host: myapp trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2

subsets: - name: v1 labels: version: v1

- name: v2
  labels:
    version: v2
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s

apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp progressDeadlineSeconds: 300 service: port: 80

analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 stepWeightPromotion: 10

metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m

- name: request-duration
  thresholdRange:
    max: 500
  interval: 30s

webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: smoke cmd: "curl -sd 'test' http://myapp-canary/api/test"

- name: load-test
  url: http://flagger-loadtester/
  timeout: 5s
  metadata:
    cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
    logCmdOutput: "true"

Automatic rollback on failure

skipAnalysis: false

undefined

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v1 namespace: production spec: replicas: 3 selector: matchLabels: app: myapp version: v1 template: metadata: labels: app: myapp version: v1 spec: containers: - name: myapp image: myrepo/myapp:1.0.0 ports: - containerPort: 8080

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v2 namespace: production spec: replicas: 1 # Start with minimal replicas for canary selector: matchLabels: app: myapp version: v2 template: metadata: labels: app: myapp version: v2 spec: containers: - name: myapp image: myrepo/myapp:2.0.0 ports: - containerPort: 8080

apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp namespace: production spec: hosts: - myapp http: # Canary: 5% to v2, 95% to v1 - match: - headers: user-agent: regex: ".Chrome." # Test with Chrome route: - destination: host: myapp subset: v2 weight: 100 timeout: 10s

# Default route with traffic split
- route:
    - destination:
        host: myapp
        subset: v1
      weight: 95
    - destination:
        host: myapp
        subset: v2
      weight: 5
  timeout: 10s

apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp namespace: production spec: host: myapp trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2

subsets: - name: v1 labels: version: v1

- name: v2
  labels:
    version: v2
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s

apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp progressDeadlineSeconds: 300 service: port: 80

analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 stepWeightPromotion: 10

metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m

- name: request-duration
  thresholdRange:
    max: 500
  interval: 30s

webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: smoke cmd: "curl -sd 'test' http://myapp-canary/api/test"

- name: load-test
  url: http://flagger-loadtester/
  timeout: 5s
  metadata:
    cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
    logCmdOutput: "true"

Automatic rollback on failure

skipAnalysis: false

undefined

2. Kubernetes Native Canary Script

2. Kubernetes原生金丝雀脚本

bash

#!/bin/bash

bash

#!/bin/bash

canary-rollout.sh - Canary deployment with k8s native tools

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" NEW_VERSION="${3:-latest}" CANARY_WEIGHT=10 MAX_WEIGHT=100 STEP_WEIGHT=10 CHECK_INTERVAL=60 MAX_ERROR_RATE=0.05

echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" NEW_VERSION="${3:-latest}" CANARY_WEIGHT=10 MAX_WEIGHT=100 STEP_WEIGHT=10 CHECK_INTERVAL=60 MAX_ERROR_RATE=0.05

echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"

Get current replicas

CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE"
-o jsonpath='{.spec.replicas}') CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))

echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"

CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE"
-o jsonpath='{.spec.replicas}') CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))

echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"

Create canary deployment

kubectl set image deployment/${DEPLOYMENT}-canary
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION}
-n "$NAMESPACE"

Scale up canary gradually

CURRENT_WEIGHT=$CANARY_WEIGHT while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"

Update ingress or service to split traffic

kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'

Wait and check metrics

echo "Monitoring metrics for ${CHECK_INTERVAL}s..." sleep $CHECK_INTERVAL

Check error rate

ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" --
curl -s http://localhost:8080/metrics | grep http_requests_total |
awk '{print $2}' || echo "0")

if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then echo "ERROR: Error rate exceeded threshold: $ERROR_RATE" echo "Rolling back canary deployment..." kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}' exit 1 fi

CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT)) done

CURRENT_WEIGHT=$CANARY_WEIGHT while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"

Update ingress or service to split traffic

kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'

Wait and check metrics

echo "Monitoring metrics for ${CHECK_INTERVAL}s..." sleep $CHECK_INTERVAL

Check error rate

ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" --
curl -s http://localhost:8080/metrics | grep http_requests_total |
awk '{print $2}' || echo "0")

if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then echo "ERROR: Error rate exceeded threshold: $ERROR_RATE" echo "Rolling back canary deployment..." kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}' exit 1 fi

CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT)) done

Promote canary to stable

echo "Canary deployment successful! Promoting to stable..." kubectl set image deployment/${DEPLOYMENT}
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION}
-n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary deployment complete!"

undefined

echo "Canary deployment successful! Promoting to stable..." kubectl set image deployment/${DEPLOYMENT}
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION}
-n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary deployment complete!"

undefined

3. Metrics-Based Canary Analysis

3. 基于指标的金丝雀分析

yaml

undefined

yaml

undefined

canary-monitoring.yaml

apiVersion: v1 kind: ConfigMap metadata: name: canary-analysis namespace: production data: analyze.sh: | #!/bin/bash set -euo pipefail

CANARY_DEPLOYMENT="${1:-myapp-canary}"
STABLE_DEPLOYMENT="${2:-myapp-stable}"
THRESHOLD="${3:-0.05}"  # 5% error rate threshold
NAMESPACE="production"

echo "Analyzing canary metrics..."

# Query Prometheus for metrics
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

echo "Canary Error Rate: $CANARY_ERROR_RATE"
echo "Stable Error Rate: $STABLE_ERROR_RATE"
echo "Canary P95 Latency: ${CANARY_LATENCY}s"
echo "Stable P95 Latency: ${STABLE_LATENCY}s"

# Check if canary is within acceptable range
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
  echo "FAIL: Canary error rate exceeds threshold"
  exit 1
fi

if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
  echo "FAIL: Canary latency is 20% higher than stable"
  exit 1
fi

echo "PASS: Canary meets quality criteria"
exit 0

apiVersion: batch/v1 kind: Job metadata: name: canary-analysis namespace: production spec: template: spec: containers: - name: analyzer image: curlimages/curl:latest command: - sh - -c - | apk add --no-cache bc jq bash /scripts/analyze.sh volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: canary-analysis restartPolicy: Never

undefined

apiVersion: v1 kind: ConfigMap metadata: name: canary-analysis namespace: production data: analyze.sh: | #!/bin/bash set -euo pipefail

CANARY_DEPLOYMENT="${1:-myapp-canary}"
STABLE_DEPLOYMENT="${2:-myapp-stable}"
THRESHOLD="${3:-0.05}"  # 5% error rate threshold
NAMESPACE="production"

echo "Analyzing canary metrics..."

# Query Prometheus for metrics
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

echo "Canary Error Rate: $CANARY_ERROR_RATE"
echo "Stable Error Rate: $STABLE_ERROR_RATE"
echo "Canary P95 Latency: ${CANARY_LATENCY}s"
echo "Stable P95 Latency: ${STABLE_LATENCY}s"

# Check if canary is within acceptable range
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
  echo "FAIL: Canary error rate exceeds threshold"
  exit 1
fi

if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
  echo "FAIL: Canary latency is 20% higher than stable"
  exit 1
fi

echo "PASS: Canary meets quality criteria"
exit 0

apiVersion: batch/v1 kind: Job metadata: name: canary-analysis namespace: production spec: template: spec: containers: - name: analyzer image: curlimages/curl:latest command: - sh - -c - | apk add --no-cache bc jq bash /scripts/analyze.sh volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: canary-analysis restartPolicy: Never

undefined

4. Automated Canary Promotion

4. 自动化金丝雀升级

bash

#!/bin/bash

bash

#!/bin/bash

promote-canary.sh - Automatically promote successful canary

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" MAX_DURATION="${3:-600}" # Max 10 minutes for canary

start_time=$(date +%s)

echo "Starting automated canary promotion for $DEPLOYMENT"

while true; do current_time=$(date +%s) elapsed=$((current_time - start_time))

if [ $elapsed -gt $MAX_DURATION ]; then echo "ERROR: Canary exceeded max duration" exit 1 fi

Check canary health

CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE"
-o jsonpath='{.status.readyReplicas}') CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE"
-o jsonpath='{.spec.replicas}')

if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then echo "Waiting for canary pods to be ready..." sleep 10 continue fi

Run analysis

if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then echo "Canary analysis passed! Promoting to stable..."

# Merge canary into stable
kubectl set image deployment/${DEPLOYMENT} \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
  -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary promoted successfully!"
exit 0

else echo "Canary analysis failed. Rolling back..." exit 1 fi done

undefined

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" MAX_DURATION="${3:-600}" # Max 10 minutes for canary

start_time=$(date +%s)

echo "Starting automated canary promotion for $DEPLOYMENT"

while true; do current_time=$(date +%s) elapsed=$((current_time - start_time))

if [ $elapsed -gt $MAX_DURATION ]; then echo "ERROR: Canary exceeded max duration" exit 1 fi

Check canary health

CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE"
-o jsonpath='{.status.readyReplicas}') CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE"
-o jsonpath='{.spec.replicas}')

if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then echo "Waiting for canary pods to be ready..." sleep 10 continue fi

Run analysis

if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then echo "Canary analysis passed! Promoting to stable..."

# Merge canary into stable
kubectl set image deployment/${DEPLOYMENT} \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
  -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary promoted successfully!"
exit 0

else echo "Canary analysis failed. Rolling back..." exit 1 fi done

undefined

Canary Best Practices

最佳实践

✅ DO

✅ 建议做法

Start with small traffic percentage (5-10%)
Monitor key metrics continuously
Increase gradually based on metrics
Implement automatic rollback
Run load tests on canary
Test with real user traffic
Set appropriate thresholds
Document rollback procedures

从低流量比例开始（5-10%）
持续监控关键指标
根据指标逐步提升流量比例
实现自动回滚机制
对金丝雀版本进行负载测试
使用真实用户流量测试
设置合理的阈值
记录回滚流程

❌ DON'T

❌ 避免做法

Rush through canary phases
Ignore metrics
Mix canary and stable versions
Deploy to all users at once
Skip rollback testing
Use artificial load only
Set unrealistic thresholds
Deploy unvalidated changes

仓促完成金丝雀发布阶段
忽略指标监控
混合金丝雀版本与稳定版本
一次性向所有用户部署
跳过回滚测试
仅使用人工负载测试
设置不切实际的阈值
部署未验证的变更

Metrics to Monitor

需监控的指标

Error Rate: 5xx errors increase
Latency: P95/P99 response time
Throughput: Requests per second
Resource Usage: CPU, memory
Business Metrics: Conversion rate, revenue
User Experience: Session duration, bounce rate

错误率：5xx错误数量增加
延迟：P95/P99响应时间
吞吐量：每秒请求数
资源使用率：CPU、内存
业务指标：转化率、收入
用户体验：会话时长、跳出率

canary-deployment

Original

Translation

Canary Deployment

金丝雀部署

Overview

概述

When to Use

适用场景

Implementation Examples

实现示例

1. Istio-based Canary Deployment

1. 基于Istio的金丝雀部署

canary-deployment-istio.yaml

canary-deployment-istio.yaml

Automatic rollback on failure

Automatic rollback on failure

2. Kubernetes Native Canary Script

2. Kubernetes原生金丝雀脚本

canary-rollout.sh - Canary deployment with k8s native tools

canary-rollout.sh - Canary deployment with k8s native tools

Get current replicas

Get current replicas

Create canary deployment

Create canary deployment

Scale up canary gradually

Scale up canary gradually

Update ingress or service to split traffic

Wait and check metrics

Check error rate

Update ingress or service to split traffic

Wait and check metrics

Check error rate

Promote canary to stable

Promote canary to stable

3. Metrics-Based Canary Analysis

3. 基于指标的金丝雀分析

canary-monitoring.yaml

canary-monitoring.yaml

4. Automated Canary Promotion

4. 自动化金丝雀升级

promote-canary.sh - Automatically promote successful canary

promote-canary.sh - Automatically promote successful canary

Check canary health

Run analysis

Check canary health

Run analysis

Canary Best Practices

最佳实践

✅ DO

✅ 建议做法

❌ DON'T

❌ 避免做法

Metrics to Monitor

需监控的指标

Resources

参考资源