Canary Deployments
Get started with canary deployment on Kubernetes using Flagger.
Overview
This documentation guides you through the process of setting up Flagger Canary deployment with NGINX for SecureAuth. Flagger is a progressive delivery tool that automates the release process using canary deployments. It provides observability, control, and operational simplicity when deploying microservices on Kubernetes.
Canary deployment is a process which provides a balance between risk and progress, ensuring that new versions can be tested and rolled out in a controlled manner. Each step of the canary is described below.
User makes a request: This is the initial user action. It could be accessing a website, using a feature of an application, etc.
Request reaches the Ingress: The Ingress is like a traffic controller that manages incoming requests. It decides where to route each user request based on certain rules or strategies.
Majority of requests are routed to the stable version (V1.0): To minimize risk, the Ingress initially directs the majority of user requests to the current stable version of the software.
Few Canary Users are routed to the new version (V2.0): A small percentage of users (the "canary users") are directed to the new version of the software. The purpose of this is to test the new version in a live production environment with a limited user base, reducing potential impact in case of unforeseen issues.
Performance evaluation: The new version's performance is monitored closely.
If it performs badly: If any significant issues or performance degradation are detected with the new version, the canary deployment is aborted. In this case, all users are then routed back to the stable version until the issues with the new version are resolved.
If it performs well: If the new version operates as expected and no significant issues are found, the percentage of user traffic directed to the new version is gradually increased over time. This is represented by the "Increase Canary Users over Time" step.
Gradual replacement of the stable version: As the new version proves to be stable and efficient, it gradually takes over the entire user traffic from the stable version, thus completing the canary deployment process.
Note
For a complete and ready-to-use solution, consider exploring our SecureAuth on Kubernetes via the GitOps approach. Get started with our quickstart guide, and delve deeper with the deployment configuration details.
Prerequisites
Before proceeding, ensure that you have the following tools installed on your system:
Set Up Kubernetes Kind Cluster
If you don't have a Kubernetes cluster, you can set up a local one using Kind. Kind allows you to run Kubernetes on your local machine. This step is optional if you already have a Kubernetes cluster set up.
Create a configuration file named kind-config.yaml
with the following content:
kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraPortMappings: - containerPort: 30443 hostPort: 8443 protocol: TCP
This configuration creates a new Kind cluster with a single control-plane node and map port 30443 from the container to port 8443 on the host. It is used to access NGINX ingress on localhost
.
Now, you can create a new Kind cluster with this configuration:
kind create cluster --name=my-cluster --config=kind-config.yaml
To learn more, visit the Kind official documentation.
Install NGINX Ingress Controller
Flagger uses the NGINX Ingress Controller to control the traffic routing during the canary deployment. Flagger modifies the Ingress resource to gradually shift traffic from the stable version of the ACP app to the canary version. This allows us to monitor how the system behaves under the canary version without fully committing to it.
To install the NGINX Ingress Controller in the nginx
namespace, use the following commands:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update ingress-nginx helm install nginx-ingress ingress-nginx/ingress-nginx --create-namespace --namespace nginx --set controller.service.type=NodePort --set controller.service.nodePorts.https=30443
--set controller.service.type=NodePort
sets the type of the service to NodePort, allowing it to be accessible on a static port on the cluster.--set controller.service.nodePorts.https=30443
specifies the node port for HTTPS traffic as 30443. Any HTTPS traffic sent to this port will be forwarded to the NGINX service.
To learn more, visit the NGINX Ingress Controller official documentation.
Install Prometheus Operator
Prometheus Operator simplifies the deployment and configuration of Prometheus, a monitoring system and time series database. Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.
Flagger uses Prometheus to retrieve metrics about SecureAuth and uses this information to make decisions during the canary deployment. If the metrics indicate that the canary version is causing issues, Flagger will halt the rollout and revert to the stable version.
To install prometheus stack in the monitoring
namespace, use the following commands:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update prometheus-community helm install prometheus prometheus-community/kube-prometheus-stack --create-namespace --namespace monitoring --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
-- By default, the Prometheus Operator only selects ServiceMonitors within its own release. This flag ensures that the Prometheus Operator will also select ServiceMonitors outside of its own release. This is necessary when you want to monitor services in other namespaces or from other releases like SecureAuth.
To learn more, visit the Prometheus Operator Helm Chart repository.
Install Flagger with NGINX Backend
Flagger is a Kubernetes operator that automates the promotion of canary deployments using various service mesh providers, including NGINX. Flagger requires a running Kubernetes cluster and uses Prometheus for monitoring and collecting metrics during the canary deployment process. In this step, install Flagger with the NGINX backend:
helm repo add flagger https://flagger.app helm repo update flagger helm upgrade -i flagger flagger/flagger --create-namespace --namespace=flagger --set meshProvider=nginx --set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090 helm upgrade -i loadtester flagger/loadtester --namespace=flagger
--set meshProvider=nginx
-- the flag sets the service mesh provider to NGINX. Flagger supports multiple service mesh providers, and in this case, you're specifying that you're using NGINX.--set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
-- the flag sets the URL of the metrics server, where Flagger will fetch metrics from during canary analysis. In this case, the Prometheus server's service address in the monitoring namespace is being used.
To learn more, visit the Flagger official documentation.
Install SecureAuth Helm chart
In this step, you need to install SecureAuth on Kubernetes using the kube-acp-stack Helm Chart.
Define Flagger Canary Custom Resource for SecureAuth
In this step, you'll define a Flagger Canary custom resource. This resource describes the desired state for SecureAuth deployment and plays a key role in enabling the automated canary deployment process.
This custom resource is broken down into the following sections:
Canary
includes configuration about the deployment and service to watch, the ingress reference, service port details, analysis, and related webhooks.Parameter
Description
provider
Specifies the service mesh provider, in this case, NGINX.
targetRef
Identifies SecureAuth deployment object that Flagger will manage.
ingressRef
Identifies SecureAuth Ingress object that Flagger will manage.
service
Specifies ports used by SecureAuth service.
analysis
Defines the parameters for the canary analysis as below.
interval
Defines the period of single stepWeight iteration.
stepWeights
Number of steps and their traffic routing weights for the canary service.
threshold
Number of times a canary analysis is allowed to fail before its rolled back.
webhooks
These are used for running validation tests before a canary is started. In this case, a pre-rollout webhook is configured to check the "alive" status.
metrics
This section defines the metrics that will be checked at the end of each iteration. Each metric includes a reference to a MetricTemplate, a threshold, and an interval for fetching the metric.
MetricTemplate
describes how to fetch the SecureAuth metrics from Prometheus. It contains a custom query that Flagger will use to fetch and analyze metrics from Prometheus.ServiceMonitor
is configured to monitor the canary version of SecureAuth. This enables Flagger and Prometheus to monitor the performance of the canary version during the canary analysis.
Note
Below configuration is an example configuration for the purpose of this article. For a complete list of recommended metrics to monitor during a canary release of SecureAuth, refer to the Recommended SecureAuth Metrics to be Monitored section.
apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: acp spec: provider: nginx targetRef: apiVersion: apps/v1 kind: Deployment name: acp ingressRef: apiVersion: networking.k8s.io/v1 kind: Ingress name: acp service: port: 8443 targetPort: http portDiscovery: true analysis: interval: 1m threshold: 2 stepWeights: [5, 10, 15, 20, 25, 35, 45, 60, 75, 90] webhooks: - name: "check alive" type: pre-rollout url: http://loadtester/ timeout: 15s metadata: type: bash cmd: "curl -k https://acp-canary.acp:8443/alive" metrics: - name: "ACP P95 Latency" templateRef: name: acp-request-duration thresholdRange: max: 0.25 interval: 60s --- apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: acp-request-duration spec: provider: type: prometheus address: http://prometheus-kube-prometheus-prometheus.monitoring:9090 query: | avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0) --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: acp-canary spec: endpoints: - port: metrics interval: 10s selector: matchLabels: app.kubernetes.io/name: acp-canary
Save the above YAML configuration in a file named acp-canary.yaml
and apply it to your cluster:
kubectl apply -f acp-canary.yaml --namespace acp
Applying those resource creates a second SecureAuth ingress, along with additional services for the canary and primary deployments. Check if the Canary resource has been successfully initialized by running:
kubectl get canaries --namespace acp
The status should be initialized
To learn more, visit the NGINX Canary Deployments.
Triggering Canary Release, Simulating Latency and Observing Canary Failures
In this step, we will trigger a canary release by making a change in the SecureAuth Helm chart. We will then simulate latency to cause the canary process to fail. We will also learn how to observe the canary process.
Access Ingress on Kind with Local Domain
In order to access the ingress on kind using the local
domain, you need to map the domain to your localhost in your hosts file.
Open the hosts file in a text editor.
sudo vi /etc/hosts
Add the following line to the file:
127.0.0.1 acp.local
Save your changes and exit the text editor.
Now, you should be able to access ingress on the kind cluster via https://acp.local:8443
from your browser.
Make a Change in the SecureAuth Helm Chart to Trigger Canary Release
helm upgrade acp cloudentity/kube-acp-stack --namespace acp --set acp.serviceMonitor.enabled=true --set acp.config.data.logging.level=warn
It starts a canary version of SecureAuth:
kubectl get pods -n acp -l 'app.kubernetes.io/name in (acp, acp-primary)' NAME READY STATUS RESTARTS AGE acp-69cdf99895-b6xcb 1/1 Running 0 3m6s acp-primary-868d5889d6-5xj59 1/1 Running 0 11m
Pod named
acp-<id>
is canary version of SecureAuth which is analyzed before promotion.Pod named
acp-primary-<id>
is our currently deployed version of SecureAuth.
Simulate Latency Using tc
Command
We want to add latency just to the canary pod in order to simulate issues with the deployed new version.
Get SecureAuth pod network interface index, in the case below its
48
as indicated byeth0@if48
.kubectl exec acp-69cdf99895-b6xcb --namespace acp -- ip link show eth0 2: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP link/ether 76:07:42:2a:3e:e0 brd ff:ff:ff:ff:ff:ff
Identify pod interface name in the Kind cluster. We will use number from previous command, in this case it will be
veth3ab723f4
as indicated by number48
next to it (note that@if2
was removed from the name).docker exec my-cluster-control-plane ip link show 48: veth3ab723f4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether da:1f:26:cf:a2:e9 brd ff:ff:ff:ff:ff:ff link-netns cni-43b9239b-e0d1-b12f-16a6-0fcfc72e8df0
Set latency on using name with the below command:
docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 10ms
It's also important to note that adding latency this way, affects all individual packets. If a requiest is made of multiple packets, the latency sums up which can be observed later on.
Observe the Canary Process
During the canary deployment, Flagger gradually shifts traffic from the old version to the new version of SecureAuth while monitoring the defined metrics. You can observe the progress of the canary deployment by querying the Flagger logs with kubectl
.
kubectl -n acp describe canary/acp kubectl logs flagger-5d4f4f785-gjdt6 --namespace flagger --follow
We will start a simple curl
command to connect to SecureAuth. As we had set latency to 10ms, responses from canary version are delayed. You can observe over time as more and more requests are hitting the canary version of SecureAuth.
If you used canary manifest from previous steps, canary lasts for 10 minutes and progressivly shifts traffic into the canary version starting from 5% in the first iteration to 90% in the last iteration.
while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done Response time: 0.006966s Response time: 0.119855s <- canary version (static delay of 10ms) Response time: 0.068919s
Now, lets increase the latency even more causing the canary release to fail:
docker exec my-cluster-control-plane tc qdisc add del veth3ab723f4 root docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 100ms
The times are above 250ms threshold:
while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done Response time: 0.607907s Response time: 0.010327s <- primary version (no delay) Response time: 0.608345s
If the average latency is above the threshold, you can see the canary process failing as can be observed in the flagger logs:
{"level":"info","ts":"2023-05-18T14:20:58.628Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up acp.acp","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:21:58.617Z","caller":"controller/events.go:33","msg":"Starting canary analysis for acp.acp","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:21:58.734Z","caller":"controller/events.go:33","msg":"Pre-rollout check check alive passed","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:21:58.898Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 5","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:22:58.761Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 10","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:23:58.755Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 15","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:24:58.754Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 20","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:25:58.747Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 25","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:26:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 35","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:27:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 45","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:28:58.752Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 60","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:31:58.617Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 1.34 > 0.25","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:32:58.618Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 0.97 > 0.25","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:33:58.616Z","caller":"controller/events.go:45","msg":"Rolling back acp.acp failed checks threshold reached 2","canary":"acp.acp"} {"level":"info","ts":"2023-05-18T14:33:58.749Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down acp.acp","canary":"acp.acp"}
Clean Up
Once you're done with your testing, you may want to clean up the resources you've created. Since all our resources are created within the kind cluster, we just need to delete the kind cluster to clean up.
To delete the kind cluster, run the following command:
kind delete cluster --name=my-cluster
This command deletes the Kind cluster named my-cluster
, and with it all the resources within the cluster. Please replace my-cluster
with the name of your cluster if it's different. Be aware that this deletes all the resources within the cluster, including any applications or services you've deployed.
Also, remember to clean up any changes you've made to your /etc/hosts
file.
Recommended SecureAuth Metrics to be Monitored
When performing canary deployments, it's crucial to monitor specific metrics to ensure the new version is performing as expected.
Here are some recommended metrics for the SecureAuth application:
Error Rates
Monitoring the rate of various HTTP error codes (5xx) can help identify issues with the new version.
apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: acp-error-rate spec: provider: type: prometheus address: http://prometheus-kube-prometheus-prometheus.monitoring:9090 query: | (100 - sum(rate(acp_http_duration_seconds_count{job="acp-canary", status_code!~"5.."}[{{ interval }}])) / sum(rate(acp_http_duration_seconds_count{job="acp-canary"}[{{ interval }}])) * 100) or on() vector(0)
Request Duration
This is the time taken to serve a request. It can be helpful in detecting performance regressions in the new version.
apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: acp-request-duration spec: provider: type: prometheus address: http://prometheus-kube-prometheus-prometheus.monitoring:9090 query: | avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)
Queue Pending Messages
This metric represents the number of messages currently pending in the queue. A sudden increase might indicate a problem with processing the queue's messages.
--- apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: acp-pending-messages spec: provider: type: prometheus address: http://prometheus-kube-prometheus-prometheus.monitoring:9090 query: | sum(acp_redis_error_count{job="acp-canary"}) or on() vector(0)
Queue Processing Time
This metric represents the time it takes to process a message from the queue. Increased processing time can indicate performance issues with the new version.
apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: acp-lag-messages spec: provider: type: prometheus address: http://prometheus-kube-prometheus-prometheus.monitoring:9090 query: | avg(histogram_quantile(0.95, rate(acp_redis_lag_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) by (group, stream) or on() vector(0)