-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Operator Observability Best Practices #5975
Add Operator Observability Best Practices #5975
Conversation
3bfca0a
to
393a6e0
Compare
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
- What is the unit of measurement. See the [Prometheus base units](https://prometheus.io/docs/practices/naming/#base-units) and [Understanding metrics types](https://prometheus.io/docs/tutorials/understanding_metric_types/#types-of-metrics) | ||
- What does the output mean. | ||
When creating a new metric or recording rule that reports a resource like a ‘pod’ or a ‘container’ name, please make sure that the `namespace` is included, in order for it to be uniquely identified. | ||
**Note:** Usually the ‘namespace’ label is populated via service discovery, but there can be cases where it should be added explicitly, usually this can happen for recording rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, see that Kubernetes has a doc about the label's good practices and the namespace is not part of them, see: https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
Why we would need to have a label with the namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit this specific issue. Its related specifically for k8s. If you want to identify a container/pod you must have its namespace since the same name of a pod/container can live in more than one namespace. To get a 1:1 match you must have the namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sradco are you suggesting labelling "namespaces" on metrics? (where the user does not necessarily need to add label/annotation of namespace on K8s resources.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kavinjsir If a metric includes a pod or other resource that is tied to a namspace and its name can be the same name in different namespaces, then the developer should make sure the metric includes the namespace so that the user can identify the correct resource in question.
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
### Metrics Guidelines | ||
|
||
1. Metrics `Help` message should be verbose, since it can be used to create auto generated documentation, Like its done here for example [KubeVirt metrics](https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md) and generated by [KubeVirt metrics doc generator](https://github.com/kubevirt/kubevirt/blob/main/tools/doc-generator/doc-generator.go). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sradco one thing that would be great to add here or in a third doc for we link is some examples about how to work with the events ( how to raise an event ) and how to deal with status conditions.
How to raise an event:
a) You need to pass the recorder when you set up the recorder for the controller when the event will be raised see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/main.go#L103
b) You need to add the makers to add the RBAC permissions to allow your Operator/controller raise the event see: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L66 and run make manifests
c) You can check en example of the event been called in : https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4-with-deploy-image/controllers/memcached_controller.go#L299-L303
How to work with status:
a) It is recommended use status conditionals see an example: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/api/v1alpha1/busybox_types.go#L57-L64
b) Then, you can update the status as it is done in the reconciliation as example here: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L204-L207
c) that you also need to add the marker to give the permissions to manage the status, see: https://github.com/kubernetes-sigs/kubebuilder/blob/7cd3532662567e0a7568415e271f0b29cece202c/testdata/project-v4-with-deploy-image/controllers/busybox_controller.go#L64
The above code examples are part of the deploy-image plugin, more info: https://book.kubebuilder.io/plugins/deploy-image-plugin-v1-alpha.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camilamacedo86 I believe we should add this data to https://book.kubebuilder.io/reference/. We should create an "observability" section and have alerts, metrics, events and logs under it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these both would be great there.
We can create one section for each one. Like :: events and other status conditions then we can link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camilamacedo86 @sradco just curious, are raising events and having recorder to capture the event relevant to observability? They are generic practices which are used to capture an event (change) in the controller. Similarly having a status conditional is the recommended best practice to convey the change made by the sync loop (or controller) between components. These are definitely useful, but just wondering if "monitoring" section will be the right place to have them since we would be talking more about how to monitor and collect metrics in a controller, not about capturing events.
393a6e0
to
0c60979
Compare
314e064
to
af8e638
Compare
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale Waiting for the memcached operator metrics alerts and runbooks to replace the examples here with the new ones. |
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
7c4f080
to
f09f19a
Compare
@Kavinjsir I updated the document. Please review. |
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the official prometheus documentation, I like the sections here providing examples of alerting and metrics up to the operator scope.
For alerting, I suggest the instruction of alert-manager to deal with alerting more efficiently in run-time.
For metrics, I suggest the instruction of the different usage of k8s event
and metrics
.
In the future, I would expect this document to be more general for observability, one suggestion maybe to check opentelemtery on how traces
, metrics
, and logs
are instrumented.
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
and these volumes are expected to be mostly full as part of normal operation, it's likely | ||
that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire. | ||
|
||
You should work to find a solution to avoid triggering these alerts if they are not actionable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems relative to inhibition and silences.
I think it necessary to notify operator authors of the usage on these two technologies where they will greatly improve the efficiency of alerting.
Also, not sure if it can be good to provide a group manifest template to focus on alert management over operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kavinjsir I'll be happy if you can add this part to the doc after it is merged.
I would consider this as an advanced topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good.
Just mentioned that to see if that is good to tell..
9d79a8f
to
1a17fcc
Compare
@Kavinjsir I agree that this is important. Will you be able to add this information to the doc after it is merged please? |
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
website/content/en/docs/best-practices/observability-best-practices.md
Outdated
Show resolved
Hide resolved
f9e0f04
to
779c404
Compare
In this document we will outline what operators require in order to meet the "Deep Insights" capability level and provide best practices and examples for creating metrics, recording rules and alerts. Signed-off-by: Shirly Radco <[email protected]>
779c404
to
d751f5b
Compare
@sradco Yeah, for sure. This can be a little deep topic to discuss, I'd happy to go with in a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
The PR looks good to me! Thanks for detailed contribution on metrics and alerts - @sradco
It would be nice to get more set of eyes before merging this. cc: @everettraven @camilamacedo86 @Kavinjsir @umangachapagain
Signed-off-by: Shirly Radco [email protected]
Description of the change:
In this document we will outline provide best practices and examples for creating metrics, recording rules and alerts.
Motivation for the change:
This best practices guide is meant to help for operator developers that want to add or improve their operator observability.
By following these guidelines, the developers should have a clear understanding of the different observability related components and how to implement them correctly.
It will also provide the end users a better users experience when they will need to use the operator metrics and alerts.
Checklist
If the pull request includes user-facing changes, extra documentation is required:
changelog/fragments
(seechangelog/fragments/00-template.yaml
)website/content/en/docs