Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(readme): Adding docs for high prometheus memory usage, queue size and k8s support matrix #2967

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ The following table displays the tested Kubernetes and Helm versions.

| Name | Version |
| ------------- | ---------------------------------------- |
| K8s with EKS | 1.21<br/>1.22<br/>1.23<br/>1.24 |
| K8s with EKS | 1.21<br/>1.22<br/>1.23<br/>1.24<br/>1.25 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnishtala-sumo can we move that one out to a separate commit and/or PR? 🙏

| K8s with Kops | 1.22<br/>1.23<br/>1.24<br/>1.25 |
| K8s with GKE | 1.21<br/>1.22<br/>1.23<br/>1.24<br/>1.25 |
| K8s with AKS | 1.23<br/>1.24<br/>1.25 |
Expand Down
92 changes: 92 additions & 0 deletions docs/troubleshoot-collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

- [Troubleshooting Installation](#troubleshooting-installation)
- [Namespace configuration](#namespace-configuration)
- [Collection Dashboards](#collection-dashboards)
- [Collecting logs](#collecting-logs)
- [Check log throttling](#check-log-throttling)
- [Check ingest budget limits](#check-ingest-budget-limits)
Expand All @@ -15,6 +16,7 @@
- [Check the `/metrics` endpoint for Kubernetes services](#check-the-metrics-endpoint-for-kubernetes-services)
- [Check the Prometheus UI](#check-the-prometheus-ui)
- [Check Prometheus Remote Storage](#check-prometheus-remote-storage)
- [Check Prometheus memory usage](#check-prometheus-memory-usage)
- [Common Issues](#common-issues)
- [Missing metrics - cannot see cluster in Explore](#missing-metrics---cannot-see-cluster-in-explore)
- [Pod stuck in `ContainerCreating` state](#pod-stuck-in-containercreating-state)
Expand All @@ -27,6 +29,7 @@
- [Falco and Google Kubernetes Engine (GKE)](#falco-and-google-kubernetes-engine-gke)
- [Falco and OpenShift](#falco-and-openshift)
- [Out of memory (OOM) failures for Prometheus Pod](#out-of-memory-oom-failures-for-prometheus-pod)
- [Otelcol enqueue failures](#otelcol-enqueue-failures)
- [Prometheus: server returned HTTP status 404 Not Found: 404 page not found](#prometheus-server-returned-http-status-404-not-found-404-page-not-found)
- [OpenTelemetry: dial tcp: lookup collection-sumologic-metadata-logs.sumologic.svc.cluster.local.: device or resource busy](#opentelemetry-dial-tcp-lookup-collection-sumologic-metadata-logssumologicsvcclusterlocal-device-or-resource-busy)

Expand All @@ -49,6 +52,12 @@ To set your namespace context more permanently, you can run
kubectl config set-context $(kubectl config current-context) --namespace=sumologic
```

## Collection dashboards

Please consult this when diagnosing issues before diving into Kubernetes directly.

[Zaidan Collection](https://stagdata.long.sumologic.net/ui/#/library/folder/20836981)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accessible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! I was thinking about support when I added this, but we should probably only add links here that the customer can access.

Copy link
Contributor

@sumo-drosiek sumo-drosiek Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not an internal documentation


## Collecting logs

If you cannot see logs in Sumo that you expect to be there, here are the things to check.
Expand Down Expand Up @@ -239,6 +248,63 @@ You [check Prometheus logs](#prometheus-logs) to verify there are no errors duri

You can also check `prometheus_remote_storage_.*` metrics to look for success/failure attempts.

### Check Prometheus memory usage

Verify memory usage with `kubectl`. First get the actual usage with:

```
kubectl top pod -n <DEP>-collections -l app.kubernetes.io/name=prometheus
NAME CPU(cores) MEMORY(bytes)
prometheus-stag-collection-prometheus-0-state-0 89m 2853Mi
prometheus-stag-collection-prometheus-1-controller-0 146m 2991Mi
prometheus-stag-collection-prometheus-2-kubelet-0 82m 2845Mi
prometheus-stag-collection-prometheus-3-container-0 131m 3045Mi
prometheus-stag-collection-prometheus-4-container-0 102m 2817Mi
prometheus-stag-collection-prometheus-5-node-0 133m 2989Mi
prometheus-stag-collection-prometheus-6-operator-r-0 182m 2968Mi
prometheus-stag-collection-prometheus-7-prometheus-0 114m 3283Mi
prometheus-stag-collection-prometheus-8-control-pl-0 121m 3017Mi
prometheus-stag-collection-prometheus-9-sumo-0 155m 2953Mi
prometheus-stag-collection-prometheus-assembly-0 50m 821Mi
```

Now get the memory limits of prometheus containers from the pods:

```
kubectl get pod -n stag-collections -l app.kubernetes.io/name=prometheus -o "custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name,MEM_LIMIT:.spec.containers[*].resources.limits.memory"
NAME CONTAINERS MEM_LIMIT
prometheus-stag-collection-prometheus-0-state-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-1-controller-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-2-kubelet-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-3-container-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-4-container-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-5-node-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-6-operator-r-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-7-prometheus-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-8-control-pl-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-9-sumo-0 prometheus,config-reloader 20Gi,50Mi
prometheus-stag-collection-prometheus-assembly-0 prometheus,config-reloader 20Gi,50Mi
```

Note that pods can contain more than one container so in this example prometheus pods have 20Gi memory limit and config-reloader containers
have 50Mi memory limit.

When memory usage is higher than ~`95%` of Prometheus container's memory limit, (in the above case it's not: none of the containers exceeded
`95% * 20 = 19Gi` of used memory) remove WAL from within the container:
Comment on lines +292 to +293
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is proper solution. I would rather say that customer should increase memory requests/limits for Prometheus or reduce number of metrics being scraped by the Prometheus. Removing WAL is more like temporary solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we list removing WAL as the last resort, if increasing memory or reducing metrics isn't feasible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, especially because it is temporary solution or clean up after/during metrics spike


```
kubectl -n <DEP>-collections \
exec -t prometheus-<DEP>-collection-prometheus-<POD_SUFFIX> \
-c prometheus -- sh -c "rm -rf /prometheus/*"
```

and restart it:

```bash
kubectl -n <DEP>-collections \
delete pod prometheus-<DEP>-collection-prometheus-<POD_SUFFIX>
```

## Common Issues

### Missing metrics - cannot see cluster in Explore
Expand Down Expand Up @@ -405,6 +471,32 @@ The duplicated pod deletion command is there to make sure the pod is not stuck i
If you observe that Prometheus Pod needs more and more resources (out of memory failures - OOM killed Prometheus) and you are not able to
increase them then you may need to horizontally scale Prometheus. :construction: Add link to Prometheus sharding doc here.

### Otelcol enqueue failures

Enqueue failures happen when otelcol can't write to its persistent queue. They signify data loss, as otelcol is unable to locally buffer the
data, and is forced to drop it.
Comment on lines +474 to +477
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add snippet with example error message here?


#### Queue full

Most of the time, this happens because the queue is full, which is almost always caused by otelcol not being able to send data to Sumo, and
should be accompanied by other alerts. Resolving the underlying issue will resolve the alert - in this case, this alert is informational.

The queue size can be checked on the Otelcol dashboard, available in [the collection dashboards folder](#collection-dashboards).

rnishtala-sumo marked this conversation as resolved.
Show resolved Hide resolved
#### Persistent Volume failure

If there is no accompanying queue growth, this alert signifies that something is wrong with the underlying Persistent Volume. If the alert
drilldown only shows one timeseries, this is almost surely the case - sending failures affect all Pods equally. The short-term solution
involves deleting the corresponding PersistentVolumeClaim and restarting the Pod:

```bash
POD_NAME=DEP-collection-sumologic-otelcol-metrics-0
kubectl -n DEP-collections delete pvc "file-storage-${POD_NAME}" &
kubectl -n DEP-collections delete pod ${POD_NAME}
# delete pod again due to race condition (new pod is trying to use old pvc)
kubectl -n DEP-collections delete pod ${POD_NAME}
```

### Prometheus: server returned HTTP status 404 Not Found: 404 page not found

If you see the following error in Prometheus logs:
Expand Down