Skip to content

Commit

Permalink
Add Operator Observability Best Practices
Browse files Browse the repository at this point in the history
In this document we will outline what operators
require in order to meet the "Deep Insights"
capability level and provide best practices and
examples for creating metrics, recording rules and alerts.

Signed-off-by: Shirly Radco <[email protected]>
  • Loading branch information
sradco committed Aug 8, 2022
1 parent 87cdc50 commit 393a6e0
Showing 1 changed file with 255 additions and 0 deletions.
255 changes: 255 additions & 0 deletions website/content/en/docs/best-practices/observability-best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
---
title: "Operator Observability Best Practices"
linkTitle: "Observability Best practices"
weight: 6
description: This guide describes the best practices concepts for adding Observability to operators.
---

## Operator Observability Best Practices

In this document we will outline what operators require in order to meet the "Deep Insights" capability level and provide best practices and examples for creating metrics, recording rules and alerts. It is based on the general guidelines in [Operator Capability Levels](https://sdk.operatorframework.io/docs/overview/operator-capabilities/).

**Note:** For technical documentation of how to add metrics to your operator, please read the [Metrics](https://book.kubebuilder.io/reference/metrics.html) section of the Kubebuilder documentation.

### Deep Insights capability level requirements

1. **Health and Performance metrics** for all of the operator components - Implemented based on the guidelines below.
2. **Metrics Documentation** - All metrics should have documentation.
3. **Alerts** for when things are not working as expected for each of the operator's components - Implemented based on the guidelines below.
4. **Alerts Runbooks** - Each alert MUST include a runbook_url annotation and a runbook.
5. **Alerts and Metrics Tests** - E2E Testing for metrics and alerts and unit tests for alerts.
6. **Events** - Custom Resources MUST emit custom events for the operations taking place.
7. **Metering** - Operator leverages Operator Metering.

**Note:** Metering isn't mandatory at this point in order to meet the "Deep Insights" capability level requirments.

### Metrics Guidelines

1. Metrics `Help` message should be verbose, since it can be used to create auto generated documentation, Like its done here for example [KubeVirt metrics](https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md) and generated by [KubeVirt metrics doc generator](https://github.com/kubevirt/kubevirt/blob/main/tools/doc-generator/doc-generator.go).

The `help` message should include the following details:
- What does this metric measure
- What is the unit of measurement. See the [Prometheus base units](https://prometheus.io/docs/practices/naming/#base-units) and [Understanding metrics types](https://prometheus.io/docs/tutorials/understanding_metric_types/#types-of-metrics)
- What does the output mean.
When creating a new metric or recording rule that reports a resource like a ‘pod’ or a ‘container’ name, please make sure that the `namespace` is included, in order for it to be uniquely identified.
**Note:** Usually the ‘namespace’ label is populated via service discovery, but there can be cases where it should be added explicitly, usually this can happen for recording rules.
- The [metric type](https://prometheus.io/docs/concepts/metric_types/#metric-types). For example: `Gauge`/`Counter`/`Histogram` etc.
- What uniqe labels does the metric use, if applicable. Please note (https://prometheus.io/docs/practices/naming/#labels).

#### Metrics Naming
Your operator metrics should align with the Kubernetes metrics names.

Your operator users should get the same experience when searching for a metric across kubernetes operators, resources and custom resources.
1. Check if a similar Kubernetes metric, for node, container or pod, exists and try to align to it.
2. The metrics search list in the Prometheus and Grafana UI is sorted in alphabetical order.
When searching for a metric, it should be easy to identify metrics that are related to a specific operator.
That is why we recommend that your operator metrics name to follow this format:
`Operator name` prefix + the `Sub operator name` or `entity` + `metric name` based on the [Prometheus naming conventions](https://prometheus.io/docs/practices/naming/).

**Note:** In Kubernetes metrics are separated like this:
- node_network_**receive**_packets_total
- node_network_**transmit**_packets_total
In this example based on `receive` and `transmit`.

Please follow the same principle and don't put similar metrics details as labels, so the user experience would be fluent.
Example for this in an operator:
- kubevirt_vmi_network_**receive**_errors_total
- kubevirt_vmi_network_**transmit**_bytes_total
- kubevirt_migrate_vmi_**data_processed**_bytes
- kubevirt_migrate_vmi_**data_remaining**_bytes

#### Recording Rules Naming
As per [Prometheus](https://prometheus.io/docs/prometheus) documentation, [Recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.

**Note:** The Prometheus recording rules appear in Prometheus UI as metrics.
In order to easily identify your operator recording rules, their name should follow the same naming guidelines as the metrics.

### Alerts Guidelines
Clear and actionable alerts are a key component of a smooth operational
experience. Ensuring we have clear and concise guidelines for developers and
administrators creating new alerts will result in a better
experience for end users.

There should be a clear guidance aligning critical, warning, and info
alert severities with expected outcomes across components, to avoide alerts fatigue for administrators.
There must be an acceptance criteria for critical alerts.

#### Recommended Reading

A list of references on good alerting practices:

* [Google SRE Book - Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
* [Prometheus Alerting Documentation](https://prometheus.io/docs/practices/alerting/)
* [Alerting for Distributed Systems](https://www.usenix.org/sites/default/files/conference/protected-files/srecon16europe_slides_rabenstein.pdf)

#### Alert Ownership

Individual teams are responsible for writing and maintaining alerting rules for
their components, i.e. their operators and operands.

Teams should also take into consideration how their components interact with
existing monitoring and alerting. As an example, if your operator deploys a
service which creates one or more `PersistentVolume` resources, and these
volumes are expected to be mostly full as part of normal operation, it's likely
that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire.
You should work with the monitoring team to find a solution to avoid triggering
these alerts if they are not actionable.

#### Alerts Style Guide

* Alert names MUST be CamelCase, e.g.: `PrometheusRuleFailures`
* Alert names SHOULD be prefixed with a component, e.g.: `AlertmanagerFailedReload`
* There may be exceptions for some broadly scoped alerts, e.g.: `TargetDown`
* Alerts MUST include a `severity` label indicating the alert's urgency.
* Valid severities are: `critical`, `warning`, or `info` — see below for
guidelines on writing alerts of each severity.
* Alerts MUST include `summary` and `description` annotations.
* Think of `summary` as the first line of a commit message, or an email
subject line. It should be brief but informative. The `description` is the
longer, more detailed explanation of the alert.
* Alerts SHOULD include a `namespace` label indicating the source of the alert.
* Many alerts will include this by virtue of the fact that their PromQL
expressions result in a namespace label. Others may require a static
namespace label — see for example, the [KubeCPUOvercommit](https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L235-L247) alert.
* All critical alerts MUST include a `runbook_url` annotation.
* Runbook style documentation for resolving critical alerts is required.
Your operator alert runbooks can be saved in your operator repository,
at [OpenShift Runbooks](https://github.com/openshift/runbooks) if your operator is shipped with OpenShift
or another location that fits your operator.
* If you are using Github, you can use [Github Pages](https://pages.github.com/) for a better view of the runbooks.
* Alerts SHOULD include a `kubernetes_operator_part_of` label indicating the operator name. Label name is based on the [Kubernetes Recommended Labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels).

**Optional Alerts Labels**
* `priority` label indicating the alert's level of importance and the order in which it should be fixed.
* Valid priorities are: `high`, `medium`, or `low`.
Higher the priority the sooner the alert should be resolved.
* If the alert doesn't include a `priority` label, we can assume it is a `medium` priority alert.
This label will usually be used for alerts with 'warning' severity, to indicate the order in which the alert should be addressed by, even though it doesn't require immediate action.

#### Alerts Severity
##### Critical Alerts

TL/DR: For alerting current and impending disaster situations. These alerts
page an SRE. The situation should warrant waking someone in the middle of the
night.

Timeline: ~5 minutes.

Reserve critical level alerts only for reporting conditions that may lead to
loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts,
unless they would result in either of those conditions. Configure critical level
alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.

Example critical alert: [KubeAPIDown](https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L412-L421)

```yaml
- alert: KubeAPIDown
annotations:
summary: Target disappeared from Prometheus target discovery.
description: KubeAPI has disappeared from Prometheus target discovery.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeAPIDown.md
expr: |
absent(up{job="apiserver"} == 1)
for: 15m
labels:
severity: critical
```
This alert fires if no Kubernetes API server instance has reported metrics
successfully in the last 15 minutes. This is a clear example of a critical
control-plane issue that represents a threat to the operability of the cluster
as a whole, and likely warrants paging someone. The alert has clear summary and
description annotations, and it links to a runbook with information on
investigating and resolving the issue.
The group of critical alerts should be small, very well defined, highly
documented, polished and with a high bar set for entry. This includes a
mandatory review of a proposed critical alert by the Red Hat SRE team.
##### Warning Alerts
TL/DR: The vast majority of alerts should use the severity. Issues at the
warning level should be addressed in a timely manner, but don't pose an
immediate threat to the operation of the cluster as a whole.
Timeline: ~60 minutes
If your alert does not meet the criteria in "Critical Alerts" above, it belongs
to the warning level or lower.
Use warning level alerts for reporting conditions that may lead to inability to
deliver individual features of the cluster, but not service for the cluster as a
whole. Most alerts are likely to be warnings. Configure warning level alerts so
that they do not fire until components have sufficient time to try to recover
from the interruption automatically. Expect users to be notified of a warning,
but for them not to respond with corrective action immediately.
Example warning alert: [ClusterNotUpgradeable](https://github.com/openshift/cluster-version-operator/blob/513a2fc/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L68-L76)
```yaml
- alert: ClusterNotUpgradeable
annotations:
summary: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
description: In most cases, you will still be able to apply patch releases.
Reason {{ "{{ with $cluster_operator_conditions := \"cluster_operator_conditions\" | query}}{{range $value := .}}{{if and (eq (label \"name\" $value) \"version\") (eq (label \"condition\" $value) \"Upgradeable\") (eq (label \"endpoint\" $value) \"metrics\") (eq (value $value) 0.0) (ne (len (label \"reason\" $value)) 0) }}{{label \"reason\" $value}}.{{end}}{{end}}{{end}}"}}
For more information refer to 'oc adm upgrade'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
expr: |
max by (name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
for: 60m
labels:
severity: warning
```
This alert fires if one or more operators have not reported their `Upgradeable`
condition as true in more than an hour. The alert has a clear name and
informative summary and description annotations. The timeline is appropriate
for allowing the operator a chance to resolve the issue automatically, avoiding
the need to alert an administrator.

##### Info Alerts

TL/DR: Info level alerts represent situations an administrator should be aware
of, but that don't necessarily require any action. Use these sparingly, and
consider instead reporting this information via Kubernetes events.

Example info alert: [MultipleContainersOOMKilled](https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/cluster-monitoring-operator/prometheus-rule.yaml#L326-L338)

```yaml
- alert: MultipleContainersOOMKilled
annotations:
description: Multiple containers were out of memory killed within the past
15 minutes. There are many potential causes of OOM errors, however issues
on a specific node or containers breaching their limits is common.
summary: Containers are being killed due to OOM
expr: sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m]))
and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
for: 15m
labels:
namespace: kube-system
severity: info
```

This alert fires if multiple containers have been terminated due to out of
memory conditions in the last 15 minutes. This is something the administrator
should be aware of, but may not require immediate action.

### Alerts, Metrics and Recording Rules Tests

1. Add tests for alerts that validate that:
- Each alert includes all mandatory fields.
- Each `runbook_url` link is valid.
- Each alert that includes a `pod` or a `container` also includes the `namespace`.
2. Add e2e tests that inspect the alerts during upgrade and make sure that the alerts don’t fire when they shouldn’t (Zero noise).
3. Add tests for metrics/recording rules that validate that:
- Metric / Recording rule exists
- Metric / Recording rule value is as expected
- Metric / Recording rule name follows the best practices guideline

### Test Plan

Automated tests enforcing acceptance criteria for critical alerts and basic style linting will be added to the [openshift/origin](https://github.com/openshift/origin) end-to-end test suite.
The monitoring team will work with anyone shipping existing critical alerts that
don't meet these criteria in order to resolve the issue before enabling the tests.

0 comments on commit 393a6e0

Please sign in to comment.