Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Operator Observability Best Practices #5975

Merged

Conversation

sradco
Copy link
Contributor

@sradco sradco commented Aug 7, 2022

Signed-off-by: Shirly Radco [email protected]

Description of the change:
In this document we will outline provide best practices and examples for creating metrics, recording rules and alerts.

Motivation for the change:
This best practices guide is meant to help for operator developers that want to add or improve their operator observability.
By following these guidelines, the developers should have a clear understanding of the different observability related components and how to implement them correctly.
It will also provide the end users a better users experience when they will need to use the operator metrics and alerts.

Checklist

If the pull request includes user-facing changes, extra documentation is required:

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2022
@sradco sradco force-pushed the create_monitoring_best_practices branch from 3bfca0a to 393a6e0 Compare August 8, 2022 08:13
- What is the unit of measurement. See the [Prometheus base units](https://prometheus.io/docs/practices/naming/#base-units) and [Understanding metrics types](https://prometheus.io/docs/tutorials/understanding_metric_types/#types-of-metrics)
- What does the output mean.
When creating a new metric or recording rule that reports a resource like a ‘pod’ or a ‘container’ name, please make sure that the `namespace` is included, in order for it to be uniquely identified.
**Note:** Usually the ‘namespace’ label is populated via service discovery, but there can be cases where it should be added explicitly, usually this can happen for recording rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, see that Kubernetes has a doc about the label's good practices and the namespace is not part of them, see: https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

Why we would need to have a label with the namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit this specific issue. Its related specifically for k8s. If you want to identify a container/pod you must have its namespace since the same name of a pod/container can live in more than one namespace. To get a 1:1 match you must have the namespace.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sradco are you suggesting labelling "namespaces" on metrics? (where the user does not necessarily need to add label/annotation of namespace on K8s resources.)

Copy link
Contributor Author

@sradco sradco Dec 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kavinjsir If a metric includes a pod or other resource that is tied to a namspace and its name can be the same name in different namespaces, then the developer should make sure the metric includes the namespace so that the user can identify the correct resource in question.

### Metrics Guidelines

1. Metrics `Help` message should be verbose, since it can be used to create auto generated documentation, Like its done here for example [KubeVirt metrics](https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md) and generated by [KubeVirt metrics doc generator](https://github.com/kubevirt/kubevirt/blob/main/tools/doc-generator/doc-generator.go).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sradco one thing that would be great to add here or in a third doc for we link is some examples about how to work with the events ( how to raise an event ) and how to deal with status conditions.

How to raise an event:

a) You need to pass the recorder when you set up the recorder for the controller when the event will be raised see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/main.go#L103

b) You need to add the makers to add the RBAC permissions to allow your Operator/controller raise the event see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L66 and run make manifests

c) You can check en example of the event been called in : https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L299-L303

How to work with status:

a) It is recommended use status conditionals see an example: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/api/v1alpha1/busybox_types.go#L57-L64

b) Then, you can update the status as it is done in the reconciliation as example here: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L204-L207

c) that you also need to add the marker to give the permissions to manage the status, see: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L64

The above code examples are part of the deploy-image plugin, more info: https://book.kubebuilder.io/plugins/deploy-image-plugin-v1-alpha.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilamacedo86 I believe we should add this data to https://book.kubebuilder.io/reference/. We should create an "observability" section and have alerts, metrics, events and logs under it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these both would be great there.
We can create one section for each one. Like :: events and other status conditions then we can link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

@sradco sradco force-pushed the create_monitoring_best_practices branch from 393a6e0 to 0c60979 Compare August 9, 2022 10:00
@sradco sradco force-pushed the create_monitoring_best_practices branch 2 times, most recently from 314e064 to af8e638 Compare August 22, 2022 09:09
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2022
@sradco
Copy link
Contributor Author

sradco commented Nov 21, 2022

/remove-lifecycle stale

Waiting for the memcached operator metrics alerts and runbooks to replace the examples here with the new ones.

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2022
@sradco sradco force-pushed the create_monitoring_best_practices branch 2 times, most recently from 7c4f080 to f09f19a Compare December 4, 2022 12:52
@sradco
Copy link
Contributor Author

sradco commented Dec 4, 2022

@Kavinjsir I updated the document. Please review.

Copy link

@Kavinjsir Kavinjsir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the official prometheus documentation, I like the sections here providing examples of alerting and metrics up to the operator scope.

For alerting, I suggest the instruction of alert-manager to deal with alerting more efficiently in run-time.

For metrics, I suggest the instruction of the different usage of k8s event and metrics.

In the future, I would expect this document to be more general for observability, one suggestion maybe to check opentelemtery on how traces, metrics, and logs are instrumented.

and these volumes are expected to be mostly full as part of normal operation, it's likely
that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire.

You should work to find a solution to avoid triggering these alerts if they are not actionable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems relative to inhibition and silences.

I think it necessary to notify operator authors of the usage on these two technologies where they will greatly improve the efficiency of alerting.
Also, not sure if it can be good to provide a group manifest template to focus on alert management over operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kavinjsir I'll be happy if you can add this part to the doc after it is merged.
I would consider this as an advanced topic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good.
Just mentioned that to see if that is good to tell..

@sradco sradco force-pushed the create_monitoring_best_practices branch 4 times, most recently from 9d79a8f to 1a17fcc Compare December 6, 2022 12:12
@sradco
Copy link
Contributor Author

sradco commented Dec 6, 2022

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

@varshaprasad96 Event are part of the level 4 capabilities, but they are not really part of this best practices doc. I can remove the reference completely or keep the link to the other doc.

@sradco I personally think it maybe worthwhile to discuss also the difference between instrumenting operators through metrics vs k8s events here. Guess some info is more suitable to be "monitored" in form of k8s events. For instance, native k8s apis provides events to observe the scaling status. As @varshaprasad96 mentioned, this is also strongly relative to k8s subresource status.

Apparently, these information can also be instrumented through metrics. Well, a common case is, by having event, it can be not necessary to additionally define metrics on that part, some solutions may be:

  • introducing kube-state-metrics to generate metrics based on the apis' native status without additional metrics definition.
  • using logging to record and aggregate events.
  • use 3rd-party tools to collect events, such as k8s-event-exporter

In short, when observing a CR, it may be good to audit events for its status, and defining metrics for other perspectives such as cpu, memoray, disk, reconciliation seconds, ...

@Kavinjsir I agree that this is important. Will you be able to add this information to the doc after it is merged please?

@sradco sradco force-pushed the create_monitoring_best_practices branch 3 times, most recently from f9e0f04 to 779c404 Compare December 6, 2022 14:26
In this document we will outline what operators
require in order to meet the "Deep Insights"
capability level and provide best practices and
examples for creating metrics, recording rules and alerts.

Signed-off-by: Shirly Radco <[email protected]>
@sradco sradco force-pushed the create_monitoring_best_practices branch from 779c404 to d751f5b Compare December 6, 2022 14:30
@Kavinjsir
Copy link

@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.

@varshaprasad96 Event are part of the level 4 capabilities, but they are not really part of this best practices doc. I can remove the reference completely or keep the link to the other doc.

@sradco I personally think it maybe worthwhile to discuss also the difference between instrumenting operators through metrics vs k8s events here. Guess some info is more suitable to be "monitored" in form of k8s events. For instance, native k8s apis provides events to observe the scaling status. As @varshaprasad96 mentioned, this is also strongly relative to k8s subresource status.
Apparently, these information can also be instrumented through metrics. Well, a common case is, by having event, it can be not necessary to additionally define metrics on that part, some solutions may be:

  • introducing kube-state-metrics to generate metrics based on the apis' native status without additional metrics definition.
  • using logging to record and aggregate events.
  • use 3rd-party tools to collect events, such as k8s-event-exporter

In short, when observing a CR, it may be good to audit events for its status, and defining metrics for other perspectives such as cpu, memoray, disk, reconciliation seconds, ...

@Kavinjsir I agree that this is important. Will you be able to add this information to the doc after it is merged please?

@sradco Yeah, for sure. This can be a little deep topic to discuss, I'd happy to go with in a follow-up.

Copy link
Contributor

@avlitman avlitman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@varshaprasad96 varshaprasad96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm
The PR looks good to me! Thanks for detailed contribution on metrics and alerts - @sradco

It would be nice to get more set of eyes before merging this. cc: @everettraven @camilamacedo86 @Kavinjsir @umangachapagain

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2022
@varshaprasad96 varshaprasad96 added this to the v1.27.0 milestone Dec 7, 2022
@jberkhahn jberkhahn merged commit e0f197c into operator-framework:master Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants