Queries

A collection of queries for inspecting data stored in a Prometheus server. Most of these queries should work in vanilla prometheus however a small number may contain a few “grafana-isms”

The following variables are used as examples

Name Description
$interval A time interval e.g. 1m. 5m, 1h etc.

Nodes

Queries relating to the state of the “physical” hardware that is hosting the cluster

The following metrics are used in the exmaples

Name Description
node_cpu_seconds_total CPU time broken down into modes e.g. idle, system, user
node_load1/5/15 1m, 5m, 15m CPU load averages.

CPU

CPU Utilisation, broken down by node

sum(rate(node_cpu_seconds_total{mode!="idle", mode!="iowait"}[$interval])) by (instance)

CPU load times, broken down by node normalised by number of cores

sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) * 100

Kubernetes

A collection of queries specific to kubernetes instances

Pods

Useful for alerts, this query returns which pods are not ready

sum(kube_pod_container_status_ready) by (pod) < 1

This returns the number of times a pod has restarted

kube_pod_container_status_restarts_total

Useful for alerts, this query will return the pods that are waiting and the reason for the delay

sum(kube_pod_container_status_waiting_reason) by (pod, reason) > 0

Services

A collection of queries for metrics useful for monitoring services

The following metrics will be used as examples

Name Description
service_requests A counter that counts the number of requests received
service_requests_latency A histogram that records the response times for a request
service_info A counter that encodes information about the service in the metric labels

The number of requests received (usually per second)

sum(rate(service_requests[$interval]))

The number of internal server errors (5xx responses), again usually per second

sum(rate(service_requests{code=~"5.*"}[$interval]))

The number of user errors (4xx responses), per second

sum(rate(service_requests{code=~"4.*"}[$interval]))

Each of the by can be broken down by service (assuming the existence of a service label in the scraped data) as follows

sum(rate(fdm_requests[$interval])) by (service)

Request latencies are recorded as a histogram and so we can only collect aggregate values. As far as I can tell the usual way to do this is to generate the following datapoints

  • 95th/99th percentile
  • 50th percentile / median
  • average response time

The Xth percentile tells us the upper bound on the amount of time X% requests are processed in. E.g. if the 95th percentile is 500ms, then 95% of requests are handled in 500ms or less.

Using these metrics we can get a feel for the distribution of the request times:

  • If the median coincides with the mean then we can infer that the response times are normally distributed
  • If the mean < median then the distribution is skewed towards zero i.e. the majority of requests are being processed quicker
  • If the mean > median then the distribution is skewed high, i.e. the majority of requests take a longer time to be processed.

To generate the Xth percentile

histogram_quantile(0.X, sum(rate(service_requests_latency_bucket[$interval])) by (le))

NOTE: This will only work if the bucket label is called le

To get the average

sum(rate(service_requests_latency_sum[$interval])) / sum(rate(service_requests_latency_count[$interval]))

To get a count of the number of services

count(sum(service_info) by (service))

Assuming the existence of a service label