Add Operator Observability Best Practices #5975

sradco · 2022-08-07T13:35:33Z

Signed-off-by: Shirly Radco [email protected]

Description of the change:
In this document we will outline provide best practices and examples for creating metrics, recording rules and alerts.

Motivation for the change:
This best practices guide is meant to help for operator developers that want to add or improve their operator observability.
By following these guidelines, the developers should have a clear understanding of the different observability related components and how to implement them correctly.
It will also provide the end users a better users experience when they will need to use the operator metrics and alerts.

Checklist

If the pull request includes user-facing changes, extra documentation is required:

Add a new changelog fragment in changelog/fragments (see changelog/fragments/00-template.yaml)
Add or update relevant sections of the docs website in website/content/en/docs

website/content/en/docs/best-practices/observability-best-practices.md

camilamacedo86 · 2022-08-08T13:21:53Z

website/content/en/docs/best-practices/observability-best-practices.md

+- What is the unit of measurement. See the [Prometheus base units](https://prometheus.io/docs/practices/naming/#base-units) and [Understanding metrics types](https://prometheus.io/docs/tutorials/understanding_metric_types/#types-of-metrics)
+- What does the output mean.
+When creating a new metric or recording rule that reports a resource like a ‘pod’ or a ‘container’ name, please make sure that the `namespace` is included, in order for it to be uniquely identified. 
+**Note:** Usually the ‘namespace’ label is populated via service discovery, but there can be cases where it should be added explicitly, usually this can happen for recording rules.


Also, see that Kubernetes has a doc about the label's good practices and the namespace is not part of them, see: https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

Why we would need to have a label with the namespace?

I hit this specific issue. Its related specifically for k8s. If you want to identify a container/pod you must have its namespace since the same name of a pod/container can live in more than one namespace. To get a 1:1 match you must have the namespace.

@sradco are you suggesting labelling "namespaces" on metrics? (where the user does not necessarily need to add label/annotation of namespace on K8s resources.)

@Kavinjsir If a metric includes a pod or other resource that is tied to a namspace and its name can be the same name in different namespaces, then the developer should make sure the metric includes the namespace so that the user can identify the correct resource in question.

website/content/en/docs/best-practices/observability-best-practices.md

camilamacedo86 · 2022-08-08T18:47:13Z

website/content/en/docs/best-practices/observability-best-practices.md

+### Metrics Guidelines
+
+1. Metrics `Help` message should be verbose, since it can be used to create auto generated documentation, Like its done here for example [KubeVirt metrics](https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md) and generated by [KubeVirt metrics doc generator](https://github.com/kubevirt/kubevirt/blob/main/tools/doc-generator/doc-generator.go).
+


@sradco one thing that would be great to add here or in a third doc for we link is some examples about how to work with the events ( how to raise an event ) and how to deal with status conditions.

How to raise an event:

a) You need to pass the recorder when you set up the recorder for the controller when the event will be raised see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/main.go#L103

b) You need to add the makers to add the RBAC permissions to allow your Operator/controller raise the event see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L66 and run make manifests

c) You can check en example of the event been called in : https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L299-L303

How to work with status:

a) It is recommended use status conditionals see an example: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/api/v1alpha1/busybox_types.go#L57-L64

b) Then, you can update the status as it is done in the reconciliation as example here: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L204-L207

c) that you also need to add the marker to give the permissions to manage the status, see: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L64

The above code examples are part of the deploy-image plugin, more info: https://book.kubebuilder.io/plugins/deploy-image-plugin-v1-alpha.html

@camilamacedo86 I believe we should add this data to https://book.kubebuilder.io/reference/. We should create an "observability" section and have alerts, metrics, events and logs under it.

I think these both would be great there.
We can create one section for each one. Like :: events and other status conditions then we can link.

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

openshift-bot · 2022-11-21T01:00:27Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sradco · 2022-11-21T20:16:04Z

/remove-lifecycle stale

Waiting for the memcached operator metrics alerts and runbooks to replace the examples here with the new ones.

website/content/en/docs/best-practices/observability-best-practices.md

sradco · 2022-12-04T13:13:21Z

@Kavinjsir I updated the document. Please review.

website/content/en/docs/best-practices/observability-best-practices.md

Kavinjsir

Apart from the official prometheus documentation, I like the sections here providing examples of alerting and metrics up to the operator scope.

For alerting, I suggest the instruction of alert-manager to deal with alerting more efficiently in run-time.

For metrics, I suggest the instruction of the different usage of k8s event and metrics.

In the future, I would expect this document to be more general for observability, one suggestion maybe to check opentelemtery on how traces, metrics, and logs are instrumented.

website/content/en/docs/best-practices/observability-best-practices.md

Kavinjsir · 2022-12-05T17:45:08Z

website/content/en/docs/best-practices/observability-best-practices.md

+and these volumes are expected to be mostly full as part of normal operation, it's likely
+that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire.
+
+You should work to find a solution to avoid triggering these alerts if they are not actionable.


This seems relative to inhibition and silences.

I think it necessary to notify operator authors of the usage on these two technologies where they will greatly improve the efficiency of alerting.
Also, not sure if it can be good to provide a group manifest template to focus on alert management over operators.

@Kavinjsir I'll be happy if you can add this part to the doc after it is merged.
I would consider this as an advanced topic.

Yeah, that sounds good.
Just mentioned that to see if that is good to tell..

sradco · 2022-12-06T12:16:10Z

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

@varshaprasad96 Event are part of the level 4 capabilities, but they are not really part of this best practices doc. I can remove the reference completely or keep the link to the other doc.

@sradco I personally think it maybe worthwhile to discuss also the difference between instrumenting operators through metrics vs k8s events here. Guess some info is more suitable to be "monitored" in form of k8s events. For instance, native k8s apis provides events to observe the scaling status. As @varshaprasad96 mentioned, this is also strongly relative to k8s subresource status.

Apparently, these information can also be instrumented through metrics. Well, a common case is, by having event, it can be not necessary to additionally define metrics on that part, some solutions may be:

introducing kube-state-metrics to generate metrics based on the apis' native status without additional metrics definition.

using logging to record and aggregate events.

use 3rd-party tools to collect events, such as k8s-event-exporter

In short, when observing a CR, it may be good to audit events for its status, and defining metrics for other perspectives such as cpu, memoray, disk, reconciliation seconds, ...

@Kavinjsir I agree that this is important. Will you be able to add this information to the doc after it is merged please?

website/content/en/docs/best-practices/observability-best-practices.md

In this document we will outline what operators require in order to meet the "Deep Insights" capability level and provide best practices and examples for creating metrics, recording rules and alerts. Signed-off-by: Shirly Radco <[email protected]>

Kavinjsir · 2022-12-06T16:49:14Z

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

@varshaprasad96 Event are part of the level 4 capabilities, but they are not really part of this best practices doc. I can remove the reference completely or keep the link to the other doc.

@sradco I personally think it maybe worthwhile to discuss also the difference between instrumenting operators through metrics vs k8s events here. Guess some info is more suitable to be "monitored" in form of k8s events. For instance, native k8s apis provides events to observe the scaling status. As @varshaprasad96 mentioned, this is also strongly relative to k8s subresource status.
Apparently, these information can also be instrumented through metrics. Well, a common case is, by having event, it can be not necessary to additionally define metrics on that part, some solutions may be:

introducing kube-state-metrics to generate metrics based on the apis' native status without additional metrics definition.

using logging to record and aggregate events.

use 3rd-party tools to collect events, such as k8s-event-exporter

In short, when observing a CR, it may be good to audit events for its status, and defining metrics for other perspectives such as cpu, memoray, disk, reconciliation seconds, ...

@Kavinjsir I agree that this is important. Will you be able to add this information to the doc after it is merged please?

@sradco Yeah, for sure. This can be a little deep topic to discuss, I'd happy to go with in a follow-up.

avlitman

LGTM

varshaprasad96

/approve
/lgtm
The PR looks good to me! Thanks for detailed contribution on metrics and alerts - @sradco

It would be nice to get more set of eyes before merging this. cc: @everettraven @camilamacedo86 @Kavinjsir @umangachapagain

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2022

openshift-ci bot requested review from asmacdo and camilamacedo86 August 7, 2022 13:35

sradco force-pushed the create_monitoring_best_practices branch from 3bfca0a to 393a6e0 Compare August 8, 2022 08:13

camilamacedo86 reviewed Aug 8, 2022

View reviewed changes

website/content/en/docs/best-practices/observability-best-practices.md Show resolved Hide resolved

camilamacedo86 reviewed Aug 8, 2022

View reviewed changes

website/content/en/docs/best-practices/observability-best-practices.md Outdated Show resolved Hide resolved