Prometheus
Prometheus is a component of NKE that allows you to monitor your applications.
Availability
Prometheus is available as an optional service for NKE. It can be deployed on an existing NKE cluster using Cockpit.
Usage
Please see the following sections for an explanation on how to use Prometheus.
General information about the setup
When Prometheus is ordered, a new Prometheus instance with two replicas will be
deployed in your NKE cluster in the nine-system
namespace. The pods will run
on the control-plane nodes, leaving your node pools fully available for your
applications.
Additionally, a new Grafana datasource will be created and automatically registered in your Grafana instance (if you have one deployed).
The Prometheus instance is based on the prometheus-operator project. Therefore, you can use the following resources to create scraping configurations and recording/alerting rules:
- ServiceMonitors
- PodMonitors
- PrometheusRules
It is possible to run multiple Prometheus instances in your cluster if needed.
Exporters and Metrics
Also, Prometheus comes with some pre-configured metrics exporters:
- CertManager
- IngressNginx
- NodeExporter
- Kubelet
- Kubelet cAdvisor
- KubeStateMetrics
- NineControllers
- Velero
You will need to tell us which of these exporters you want to enable. In the future, you will be able to enable them yourself in Cockpit.
Note that enabling all metrics of an exporter will increase the resources required for Prometheus to run. To solve this, you can also limit the amount of metrics to track by explicitly giving us a list of the wanted metrics.
To summarise, we recommend the following workflow:
- Tell us which exporters to enable and we will enable all metrics for you of said exporter
- Create your dashboards/rules
- See which metrics you need and tell us. We will limit the scrape configuration to only scrape your needed metrics.
Web UI
The default Prometheus web UI has been disabled. However, you can use Grafana to view your metrics in the web browser.
Instrumenting your application
Before Prometheus can scrape metrics from your application, you will need to instrument your application to export metrics in a special given format. You can find information about how to do this in the official Prometheus documentation.
Adding application metrics to Prometheus
Once your application supports metrics, you can use ServiceMonitors
or PodMonitors
to let Prometheus scrape your application's metrics.
ServiceMonitors
will scrape all pods which are targeted by one or more services. This resource
needs to be used in most of the cases. You need to define a label selector in
the ServiceMonitor
which will be used to find all the wanted services. The
ServiceMonitor
should be created in the same namespace as the service(s) it
selects. Next to the label selector your ServiceMonitor
should also have the
label prometheus.nine.ch/<your prometheus name>: scrape set with the name of your Prometheus
instance. Consider the
following example ServiceMonitor
and Service
definition:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
prometheus.nine.ch/myprom01: scrape
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
kind: Service
apiVersion: v1
metadata:
name: my-app-service
namespace: my-app
labels:
app: my-app
spec:
selector:
application: example-app
ports:
- name: web
port: 8080
The given ServiceMonitor
definition will select the service "my-app-service" because the label "app: my-app" exists on that service. Prometheus will then search for all pods which are targeted by this service and starts to scrape them for metrics on port 8080 (the ServiceMonitor
defines the port in the endpoints field).
PodMonitors will scrape all pods which are selected by the given label selector. It works very similiar to the ServiceMonitor
resource (just without an actual Service
resource). You can use the PodMonitor
resource if your application does not need a Service
resource (like some exporters) for any other reason. The pods should run in the same namespace as the PodMonitor
is defined. Here is an example for a PodMonitor
with a corresponding pod:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pods
namespace: my-app
labels:
prometheus.nine.ch/myprom01: scrape
spec:
selector:
matchLabels:
application: my-app
endpoints:
- port: web
apiVersion: v1
kind: Pod
metadata:
labels:
application: my-app
name: my-app
namespace: my-app
spec:
containers:
- image: mycompany/example-app
name: app
ports:
name: web
containerPort: 8080
Based on the given PodMonitor
resource the prometheus-operator will generate a scrape config which scrapes the shown pod "my-app" on port 8080 for metrics.
Prometheus will create a job for every ServiceMonitor
or PodMonitor
resource you define. It will also add a job label to all scraped metrics which have been gathered in the corresponding job. This can be used to find out from which services or pods a given metric has been scraped.
Querying for metrics
You can use PromQL to query for metrics. There are some examples on the official Prometheus page. Querying can be done by using Grafana in the explore view. When using Grafana please make sure to select the data source matching your Prometheus instance. The data source name will be <YOUR PROMETHEUS NAME>/<YOUR ORG NAME>/prometheus.
Adding rules to Prometheus
Prometheus supports two kinds of rules: recording rules and alerting rules. Both have a similar syntax, but a different use case.
Recording rules can be used to calculate new metrics from already existing ones. This can be useful if you use computationally expensive queries in dashboards. To speed them up you can create a recording rule which will evaluate the query in a defined interval and stores the result as a new metric. You can then use this new metric in your dashboard queries.
Alerting rules allow you to define alert conditions (based on PromQL). When those conditions are true, Prometheus will send out an alert to the connected Alertmanager instances. Alertmanager will then send notifications to users about alerts.
When creating alerting or recording rules, please make sure to add the prometheus.nine.ch/<your prometheus name>: scrape label with the name of your Prometheus instance. This will assign the created rule to your Prometheus instance.
The following example alerting rule will alert once a job can not reach the configured pods (targets) anymore:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/myprom01: scrape
role: alert-rules
name: jobs-check
spec:
groups:
- name: ./example.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: Critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
This alerting rule definition will trigger an alert once an up metric gets a value of 0. The up metric is a special metric as it will be added by Prometheus itself for every job target (pod). Once a pod can not be scraped anymore, the corresponding up metric will turn to 0. If the up metric is 0 for more than 5 minutes (in this case), Prometheus will trigger an alert. The specified "labels" and "annotations" can be used in Alertmanager to customize your notification messages and routing decisions.
The full spec for the PrometheusRule
definition can be found here.
Here is an example of a recording rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus.nine.ch/myprom01: scrape
role: recording-rules
name: cpu-per-namespace-recording
spec:
groups:
- name: ./example.rules
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) by (namespace)
This recording rule will create a new metric called namespace:container_cpu_usage_seconds_total:sum_rate which shows the sum of used CPU of all containers per namespace. This metric can easily be shown in a Grafana dashboard to have an overview about the CPU usage of all pods per namespace.
The kubernetes-mixins project contains sample alerts and rules for various exporters. It is a good place to get some inspiration for alerting and recording rules.
Documentation and Links
Video Guide
Checkout our video guide series for GKE Application Monitoring. While the videos are done on our GKE product, the concepts are the same.