Skip to main content

Canary Deployments

Get started with canary deployment on Kubernetes using Flagger.

note

The walkthrough below uses NGINX Gateway Fabric and the Kubernetes Gateway API as one concrete example. SecureAuth does not require a specific gateway or ingress implementation – Flagger supports several providers (Gateway API, Istio, Linkerd, NGINX Ingress, App Mesh, Contour, Kuma, Open Service Mesh, Traefik). Pick whichever matches your cluster and adapt the manifests accordingly.

Overview

This documentation guides you through setting up a Flagger canary deployment for SecureAuth. Flagger is a progressive delivery tool that automates the release process using canary deployments. It provides observability, control, and operational simplicity when deploying microservices on Kubernetes.

Canary deployment is a process which provides a balance between risk and progress, ensuring that new versions can be tested and rolled out in a controlled manner. Each step of the canary is described below.

  • User makes a request: This is the initial user action. It could be accessing a website, using a feature of an application, etc.

  • Request reaches the gateway: The gateway (an Ingress controller, a Gateway API implementation, or a service mesh ingress) routes incoming requests based on configured rules.

  • Majority of requests are routed to the stable version (V1.0): To minimize risk, the gateway initially directs the majority of user requests to the current stable version of the software.

  • Few Canary Users are routed to the new version (V2.0): A small percentage of users (the "canary users") are directed to the new version. The purpose of this is to test the new version in a live production environment with a limited user base.

  • Performance evaluation: The new version's performance is monitored closely.

  • If it performs badly: If any significant issues or performance degradation are detected with the new version, the canary deployment is aborted and all traffic returns to the stable version.

  • If it performs well: If the new version operates as expected, the percentage of user traffic directed to the new version is gradually increased over time.

  • Gradual replacement of the stable version: As the new version proves to be stable, it gradually takes over the entire user traffic.

Prerequisites

Before proceeding, ensure that you have the following tools installed on your system:

  • Kind v0.13.0+ (Kubernetes cluster v1.16+)

  • Helm v3.0+

Set Up Kubernetes Kind Cluster

If you don't have a Kubernetes cluster, you can set up a local one using Kind. This step is optional if you already have a Kubernetes cluster.

Create a configuration file named kind-config.yaml with the following content:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 30443
hostPort: 8443
protocol: TCP

This maps port 30443 inside the cluster to port 8443 on the host so you can reach the gateway on localhost.

Create the cluster:

kind create cluster --name=my-cluster --config=kind-config.yaml

To learn more, visit the Kind official documentation.

Install the Gateway API CRDs

Flagger's Gateway API provider needs the upstream Gateway API CRDs installed in the cluster:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

Install NGINX Gateway Fabric (example gateway)

This walkthrough uses NGINX Gateway Fabric as the Gateway API implementation. Any conformant Gateway API implementation (Istio, Envoy Gateway, Contour, Traefik, Cilium, Kong, etc.) will work – adjust the install step to match.

helm install ngf oci://ghcr.io/nginx/charts/nginx-gateway-fabric \
--create-namespace --namespace nginx-gateway \
--set service.type=NodePort \
--set service.ports[0].port=443 \
--set service.ports[0].nodePort=30443 \
--set service.ports[0].protocol=TCP

Create a Gateway so Flagger has something to bind canary routes to:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: acp-gateway
namespace: nginx-gateway
spec:
gatewayClassName: nginx
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: "acp.local"
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: acp-server-tls
allowedRoutes:
namespaces:
from: All

Apply it with kubectl apply -f gateway.yaml.

To learn more, visit the NGINX Gateway Fabric documentation or the Gateway API specification.

Install Prometheus Operator

Prometheus collects metrics that Flagger evaluates during canary analysis. If the canary metrics breach the configured thresholds, Flagger halts the rollout and reverts to the stable version.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update prometheus-community
helm install prometheus prometheus-community/kube-prometheus-stack --create-namespace --namespace monitoring --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
  • --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false lets the Prometheus Operator pick up ServiceMonitors created outside its own Helm release (necessary to monitor SecureAuth).

To learn more, visit the Prometheus Operator Helm Chart repository.

Install Flagger with the Gateway API Provider

Flagger has first-class support for Gateway API since v1.27. Install it configured to use the Gateway API provider and to scrape the Prometheus deployed above:

helm repo add flagger https://flagger.app
helm repo update flagger
helm upgrade -i flagger flagger/flagger \
--create-namespace --namespace=flagger \
--set meshProvider=gatewayapi:v1 \
--set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
helm upgrade -i loadtester flagger/loadtester --namespace=flagger
  • --set meshProvider=gatewayapi:v1 selects the Gateway API provider. Use nginx, istio, linkerd, etc. if your environment uses a different one.
  • --set metricsServer=... points Flagger at the Prometheus you installed.

To learn more, visit the Flagger Gateway API tutorial.

Install SecureAuth Helm chart

In this step, install SecureAuth on Kubernetes using the kube-acp-stack Helm Chart. See Kubernetes deployment for the full guide.

Define a Flagger Canary for SecureAuth

The Canary custom resource describes the desired state of the canary deployment and is the entry point for the automated rollout process.

ParameterDescription
providerThe Flagger provider. gatewayapi:v1 selects the Gateway API provider.
targetRefThe SecureAuth Deployment that Flagger manages.
service.gatewayRefsThe Gateway (and section/listener) the canary is bound to.
service.port / targetPort / hostsPorts and hostnames of the canary HTTPRoute.
analysis.intervalHow often Flagger evaluates metrics.
analysis.stepWeightsTraffic-shift steps from canary to primary.
analysis.thresholdNumber of failed evaluations before rollback.
analysis.webhooksPre-rollout / pre-traffic / post-rollout hooks.
analysis.metricsPrometheus-backed metric checks gating each step.

MetricTemplate resources describe how to query Prometheus; Flagger uses them in analysis.metrics. A ServiceMonitor exposes the canary's metrics so the Prometheus Operator can scrape them.

note

The configuration below is illustrative. For a complete list of recommended metrics to monitor during a canary release of SecureAuth, see Recommended SecureAuth Metrics to be Monitored.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: acp
spec:
provider: "gatewayapi:v1"
targetRef:
apiVersion: apps/v1
kind: Deployment
name: acp
service:
port: 8443
targetPort: http
portDiscovery: true
hosts:
- "acp.local"
gatewayRefs:
- name: acp-gateway
namespace: nginx-gateway
analysis:
interval: 1m
threshold: 2
stepWeights: [5, 10, 15, 20, 25, 35, 45, 60, 75, 90]
webhooks:
- name: "check alive"
type: pre-rollout
url: http://loadtester/
timeout: 15s
metadata:
type: bash
cmd: "curl -k https://acp-canary.acp:8443/alive"
metrics:
- name: "ACP P95 Latency"
templateRef:
name: acp-request-duration
thresholdRange:
max: 0.25
interval: 60s
---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-request-duration
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: acp-canary
spec:
endpoints:
- port: metrics
interval: 10s
selector:
matchLabels:
app.kubernetes.io/name: acp-canary

Save the above YAML as acp-canary.yaml and apply it:

kubectl apply -f acp-canary.yaml --namespace acp

Applying the Canary creates the primary and canary HTTPRoutes, plus Services for both versions. Check the Canary state with:

kubectl get canaries --namespace acp

The status should reach initialized.

Triggering Canary Release, Simulating Latency and Observing Canary Failures

In this step, you trigger a canary release by changing the SecureAuth Helm chart, then simulate latency to make the canary fail and observe Flagger's reaction.

Access the Gateway on Kind with a Local Domain

Map acp.local to your loopback so the gateway is reachable from your machine:

sudo vi /etc/hosts

Add:

127.0.0.1       acp.local

You can now reach the gateway at https://acp.local:8443.

Make a Change in the SecureAuth Helm Chart to Trigger Canary Release

helm upgrade acp cloudentity/kube-acp-stack --namespace acp \
--set acp.serviceMonitor.enabled=true \
--set acp.config.data.logging.level=warn

This starts a canary version of SecureAuth:

kubectl get pods -n acp -l 'app.kubernetes.io/name in (acp, acp-primary)'

NAME READY STATUS RESTARTS AGE
acp-69cdf99895-b6xcb 1/1 Running 0 3m6s
acp-primary-868d5889d6-5xj59 1/1 Running 0 11m
  • acp-<id> is the canary version under analysis.

  • acp-primary-<id> is the currently deployed stable version.

Simulate Latency Using tc

We want to add latency just to the canary pod in order to simulate issues with the deployed new version.

  1. Get the SecureAuth pod network interface index (here 48, as indicated by eth0@if48):

    kubectl exec acp-69cdf99895-b6xcb --namespace acp -- ip link show eth0

    2: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 76:07:42:2a:3e:e0 brd ff:ff:ff:ff:ff:ff
  2. Identify the pod's interface name in the Kind cluster. Match the number from the previous step (48veth3ab723f4):

    docker exec my-cluster-control-plane ip link show

    48: veth3ab723f4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether da:1f:26:cf:a2:e9 brd ff:ff:ff:ff:ff:ff link-netns cni-43b9239b-e0d1-b12f-16a6-0fcfc72e8df0
  3. Apply latency:

    docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 10ms

This affects every packet, so multi-packet requests accumulate the delay.

Observe the Canary Process

During the canary deployment, Flagger gradually shifts traffic from primary to canary while monitoring the metrics you defined. Observe it with:

kubectl -n acp describe canary/acp
kubectl logs flagger-5d4f4f785-gjdt6 --namespace flagger --follow

Drive traffic at the gateway:

while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done

Response time: 0.006966s
Response time: 0.119855s <- canary version (static delay of 10ms)
Response time: 0.068919s

Increase the latency to push it past the threshold:

docker exec my-cluster-control-plane tc qdisc add del veth3ab723f4 root
docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 100ms

Response times exceed the 250 ms threshold:

while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done
Response time: 0.607907s
Response time: 0.010327s <- primary version (no delay)
Response time: 0.608345s

When P95 latency stays above the threshold for analysis.threshold consecutive checks, Flagger halts the rollout and rolls back:

{"level":"info","ts":"2023-05-18T14:20:58.628Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.617Z","caller":"controller/events.go:33","msg":"Starting canary analysis for acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.734Z","caller":"controller/events.go:33","msg":"Pre-rollout check check alive passed","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.898Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 5","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:22:58.761Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 10","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:23:58.755Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 15","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:24:58.754Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 20","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:25:58.747Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:26:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 35","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:27:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 45","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:28:58.752Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 60","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:31:58.617Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 1.34 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:32:58.618Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 0.97 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.616Z","caller":"controller/events.go:45","msg":"Rolling back acp.acp failed checks threshold reached 2","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.749Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down acp.acp","canary":"acp.acp"}

Clean Up

Delete the Kind cluster to remove everything:

kind delete cluster --name=my-cluster

Remove the acp.local entry from your /etc/hosts.

When performing canary deployments, it's crucial to monitor specific metrics to ensure the new version is performing as expected.

Error Rates

Monitoring the rate of various HTTP error codes (5xx) can help identify issues with the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-error-rate
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
(100 - sum(rate(acp_http_duration_seconds_count{job="acp-canary", status_code!~"5.."}[{{ interval }}])) / sum(rate(acp_http_duration_seconds_count{job="acp-canary"}[{{ interval }}])) * 100) or on() vector(0)

Request Duration

The time taken to serve a request – useful for spotting performance regressions in the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-request-duration
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)

Queue Pending Messages

Number of messages currently pending in the queue. A sudden increase suggests the new version isn't draining the queue.

---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-pending-messages
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
sum(acp_redis_error_count{job="acp-canary"}) or on() vector(0)

Queue Processing Time

Time it takes to process a queue message. Increased processing time can indicate performance issues with the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-lag-messages
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_redis_lag_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) by (group, stream) or on() vector(0)