-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(readme): Adding docs for high prometheus memory usage, queue size and k8s support matrix #2967
base: main
Are you sure you want to change the base?
Conversation
677edd1
to
2b65886
Compare
|
||
Please consult this when diagnosing issues before diving into Kubernetes directly. | ||
|
||
[Zaidan Collection](https://stagdata.long.sumologic.net/ui/#/library/folder/20836981) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accessible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! I was thinking about support when I added this, but we should probably only add links here that the customer can access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not an internal documentation
When memory usage is higher than ~`95%` of Prometheus container's memory limit, (in the above case it's not: none of the containers exceeded | ||
`95% * 20 = 19Gi` of used memory) remove WAL from within the container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is proper solution. I would rather say that customer should increase memory requests/limits for Prometheus or reduce number of metrics being scraped by the Prometheus. Removing WAL is more like temporary solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we list removing WAL as the last resort, if increasing memory or reducing metrics isn't feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, especially because it is temporary solution or clean up after/during metrics spike
### Otelcol enqueue failures | ||
|
||
Enqueue failures happen when otelcol can't write to its persistent queue. They signify data loss, as otelcol is unable to locally buffer the | ||
data, and is forced to drop it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add snippet with example error message here?
@@ -86,7 +86,7 @@ The following table displays the tested Kubernetes and Helm versions. | |||
|
|||
| Name | Version | | |||
| ------------- | ---------------------------------------- | | |||
| K8s with EKS | 1.21<br/>1.22<br/>1.23<br/>1.24 | | |||
| K8s with EKS | 1.21<br/>1.22<br/>1.23<br/>1.24<br/>1.25 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rnishtala-sumo can we move that one out to a separate commit and/or PR? 🙏
@rnishtala-sumo Could you recreate this PR for SumoLogic docs repo? |
Adding docs for high prometheus memory usage and otelcol enqueue failures
Checklist