Monitoring System
Installing and configuring a generic monitoring system.
Monitoring at SecureAuth
Monitoring is a critical aspect for us at SecureAuth, playing a key role in our SRE, DevOps, and Support operations across diverse infrastructures. Finding an ideal solution was challenging, requiring careful consideration of various factors including system architecture and configuration. This article outlines how we utilize Thanos to store Prometheus metrics for an EKS Multi-Cluster AWS EKS-based setup. Additionally, we've provided ready-to-use monitoring configurations to ensure effective SecureAuth monitoring within your infrastructure.ion.
Requirements
Here are some assumptions we wanted our monitoring system to support:
We needed a centralized observability system to collect data from all regions' clusters (and it's able to handle a growing load)
The infrastructure should be highly available and resilient.
The solution should be cost-efficient.
Collected data should be kept for at least 90 days.
Prevent false negatives and concentrate solely on what is important.
Architecture
Components
Our monitoring system is made up of two main components:
We have local Thanos clusters deployed in each of our SaaS regions. These clusters are independent and store (and visualize) metrics from the AWS region where they are located.
The central Thanos cluster which handles alerts and gives access to the main Grafana instance (it allows us to view/visualize data from all regions in one convenient location).
Benefits
High availability - we can still perform our SRE work even if the central Thanos cluster is down, as we can access the local clusters for critical situations. In most cases, the failure of a single region is not a problem because the central Thanos cluster still has access to data collected by local instances. It's important to note that we store our data in S3 buckets.
The central Grafana instance allows us to monitor our infrastructure from a holistic perspective. This is particularly crucial as our traffic can be automatically distributed across different regions depending on the user's location.
Our solution is cost effective as we don't have to send all the data from local clusters to the central one. Instead, the data is stored securely in S3 buckets that are localized per region. These buckets are not dependent on our application infrastructure and are difficult to remove or destroy. Additionally, the traffic between the Central Cluster and Local Clusters is limited to alert notifications and data that our engineers request through Cental Grafana.
Our data is stored in S3 buckets, which provides a favorable balance between performance and costs. This enables us to retain our data for a period of 90 days and more.
The central Thanos cluster has an additional role in our company. It can monitor non-EKS-related elements, such as the state of our WAF, stats page, and official website. Additionally, this approach allows us to group alerts from different places before sending them to us via Slack or PagerDuty.
Note
For a complete and ready-to-use solution, consider exploring our SecureAuth on Kubernetes via the GitOps approach. Get started with our quickstart guide, and delve deeper with the deployment configuration details.
Installation
Note
This article focuses on configuring a simplified version of the monitoring based on the local (single) Thanos cluster, but it's worth noting that a similar configuration can be applied to a central cluster as well.
We utilize AWS EKS, however, the solution outlined in this article can likely be implemented on any Kubernetes cluster, such as GKE, AKS, or bare-metal. It assumes that you possess a fundamental understanding of Helm and Prometheus.
We utilize two helm charts: kube-prometheus-stack which installs thanos which is installed on top of the existing Prometheus deployment.
Prerequisites
Kubernetes 1.16+
kubectl CLI
Helm 3+ (configured to work with your Kubernetes cluster)
Configuration Files
Create the following files:
namespace.yaml
:
apiVersion: v1 kind: Namespace metadata: name: monitoring
Note
To demonstrate, we will use minio, a tool that creates easy S3-compatible storage. However, for storing production metrics, we highly recommend using AWS S3 or a similar solution.
objstorage-local.yaml
(for local testing purposes):
apiVersion: v1 kind: Secret type: Opaque metadata: name: thanos-objstore namespace: monitoring stringData: objstore.yml: |- type: s3 config: bucket: thanos endpoint: thanos-minio.monitoring:9000 access_key: admin secret_key: p@ssw0rd! insecure: true
objstorage-prod.yaml
(for production purposes):
apiVersion: v1 kind: Secret type: Opaque metadata: name: thanos-objstore namespace: monitoring stringData: objstore.yml: |- type: s3 config: bucket: thanos-env_name-region_name endpoint: s3.region_name.amazonaws.com region: region_name aws_sdk_auth: true
prometheus.yaml
(here you can find the list of available parameters):
alertmanager: enabled: false kubeApiServer: enabled: false kubeControllerManager: enabled: false kubeEtcd: enabled: false kubeScheduler: enabled: false grafana: adminPassword: prom-operator sidecar: datasources: url: 'http://thanos-query-frontend:9090/' fullnameOverride: prometheus prometheus: thanosService: enabled: true thanosServiceMonitor: enabled: true prometheusSpec: thanos: objectStorageConfig: key: objstore-local.yml name: thanos-objstore serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false prometheusExternalLabelNameClear: true disableCompaction: true
thanos.yaml
(here you can find the list of available parameters):
minio: enabled: true auth: rootUser: admin rootPassword: p@ssw0rd! metrics: enabled: true serviceMonitor: enabled: true prometheusRule: enabled: true default: absent_rules: true store_gateway: true compaction: true query: true query: replicaCount: 1 dnsDiscovery: sidecarsService: prometheus-thanos-discovery sidecarsNamespace: monitoring compactor: enabled: true retentionResolutionRaw: 7d retentionResolution5m: 30d retentionResolution1h: 90d persistence: size: 100Gi storegateway: enabled: true persistence: size: 10Gi existingObjstoreSecret: thanos-objstore
Installation
Add the prometheus-community
and thanos
charts repositories to Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add thanos https://charts.bitnami.com/bitnami helm repo update
Create a monitoring
namespace:
kubectl apply -f ./namespace.yaml
namespace/monitoring created
Create an objecstore
configuration:
kubectl apply -f ./objstorage-local.yaml
secret/thanos-objstore created
Install Prometheus stack:
helm install prometheus prometheus-community/kube-prometheus-stack --values ./prometheus.yaml --namespace monitoring
NAME: prometheus LAST DEPLOYED: Mon Jan 1 00:12:30 2023 NAMESPACE: default STATUS: deployed REVISION: 1 NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace monitoring get pods -l "release=prometheus"
After successfully installing Prometheus, you can access Grafana by going to http://localhost:3000
. However, make sure to replace "localhost" with the appropriate IP depending on your scenario.
Note
Default Grafana username is admin
and password is prom-operator
. Please remember to change those values in a production environment.
Grafana includes pre-built dashboards that allow you to monitor the current statistics of your clusters.
It has also preconfigured data source (Thanos Query), but as you can see by going to a randomly choosen dashboard, data is not yet visible. You can also confirm, that the data source is not passing the connection test.
Install Thanos.
helm install thanos thanos/thanos --values thanos.yaml --namespace monitoring
NAME: thanos LAST DEPLOYED: Mon Jun 26 14:22:28 2023 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: CHART NAME: thanos CHART VERSION: 12.8.3 APP VERSION: 0.31.0** Please be patient while the chart is being deployed **
Thanos chart was deployed enabling the following components:
Thanos Query
Thanos Compactor
Thanos Store Gateway
Thanos Query can be accessed through following DNS name from within your cluster:
thanos-query.default.svc.cluster.local (port 9090)
To access Thanos Query from outside the cluster execute the following commands:
Get the Thanos Query URL by running these commands:
export SERVICE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].port}" services thanos-query) kubectl port-forward --namespace default svc/thanos-query ${SERVICE_PORT}:${SERVICE_PORT} & echo "http://127.0.0.1:${SERVICE_PORT}"
Open a browser and access Thanos Query using the obtained URL.
Once the Thanos is installed, you can verify connection between the Thanos Query frontend and Grafana:
If everything looks good, you can confirm that data collected by Thanos is visible in one of pre-deployed Grafana dashboards:
Post-Installation Summary
You may be wondering why we chose to install Thanos alongside Prometheus. There are several reasons for this decision. Firstly, Thanos architecture is more scalable and easier to manage. Additionally, Thanos can store your cluster data on S3-compatible backends, and Collector ensures constant data aggregation for your historical data.
How to Uninstall
If you want to remove Prometheus and Thanos from your cluster, you can do so by executing:
helm uninstall thanos --namespace monitoring helm uninstall prometheus --namespace monitoring kubectl delete -f ./objstorage-local.yaml kubectl delete -f ./namespace.yaml
Summary
In conclusion, Thanos helps us achieve high availability, durability, and cost-efficiency with Prometheus. While it may require substantial effort to implement Thanos and Prometheus across multiple Kubernetes clusters, the benefits of having a highly available Prometheus make it a worthwhile endeavor. This project was one of our challenging experiences, and we would have appreciated having this article as a guide when we began.