Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

Open
pilot513 opened this issue Aug 10, 2023 · 15 comments
Labels
bug Something isn't working

Comments

@pilot513
Copy link

pilot513 commented Aug 10, 2023

Describe the bug
ota collector can't scrape pod metrics

Steps to reproduce
Configure prometheus exporter with prometheus endpoint

What did you expect to see?
Scrape metrics from pods and receive it to 'prometheusremotewrite'

What did you see instead?
error scrape/scrape.go:1313 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel_kubernetes_podscraper", "target": "http://ip:port/metrics", "error": "data refused due to high memory usage"}

What version did you use?
Version: 0.82

What config did you use?
Config:

...
prometheus:
endpoint: 0.0.0.0:port
metric_expiration: 120m
resource_to_telemetry_conversion:
enabled: true
send_timestamps: true
prometheusremotewrite:
endpoint: http://hostname/prometheus/api/v1/write
extensions:
health_check: {}
memory_ballast: {}
processors:
batch: {}
memory_limiter:
check_interval: 3s
limit_mib: 6553
spike_limit_mib: 2048
...

Environment
k8s Pod (from helm chart) with Limit 8G 2cores

@pilot513 pilot513 added the bug Something isn't working label Aug 10, 2023
@CarlosLanderas
Copy link

CarlosLanderas commented Aug 10, 2023

I'm having the same problem with prometheus receiver performing scrapes since the container is started and have tried different collector versions and settings:

resources:
  requests:
    cpu: 500m
    memory: 4096Mi
  limits:
    memory: 4096Mi
 processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 4000
        spike_limit_mib: 800
      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500
 extensions:          
     memory_ballast:
          size_mib: 2000

@pilot513
Copy link
Author

pilot513 commented Aug 10, 2023

I have settings:

resources:
  requests:
    cpu: 2
    memory: 4096Mi
  limits:
    memory: 8192Mi
 processors:
      memory_limiter:
        check_interval: 3s
        limit_mib: 6553
        spike_limit_mib: 2048

I should try memory_ballast as well ...

@pilot513
Copy link
Author

Does anyone have normal scrape results from pods under load?

@pilot513
Copy link
Author

Added definitions for batch

      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500

Let's see how it will be

@pilot513
Copy link
Author

At first it worked fine. And then the problems started again:
image
image
image
image

@pilot513
Copy link
Author

pilot513 commented Aug 20, 2023

After a few days of work (5), history repeats itself.

@pilot513
Copy link
Author

image
image

@bhupeshpadiyar
Copy link

Hey @pilot513 , did you managed to fix this issue some how?
I am also facing the same issue with histogram metrics export. Please help to suggest the solution if you managed to find the one.

Thanks

@pilot513
Copy link
Author

In my case, I noticed that the number of metrics was constantly growing. I began to study this issue, and discovered that one application was generating a constant increase in unique metrics. It shouldn't be this way. I pointed this out to the developers, and they fixed it because their code for expose metrics was incorrect. As soon as I reinstalled the application, the problem went away.

@martinohansen
Copy link

I'm seeing the same thing, memory usage keeps going up until the receiver starts falling. At that point I begin to see export failures and the export queue goes up as well. We're sending around 35K/s data points across 350 scrape targets.

Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector error scrape/scrape.go:1351	Scrape commit failed	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "k8s", "target": "http://10.30.28.208:6666/metrics", "error": "data refused due to high memory usage"}
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
	github.com/prometheus/[email protected]/scrape/scrape.go:1351
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
	github.com/prometheus/[email protected]/scrape/scrape.go:1429
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
	github.com/prometheus/[email protected]/scrape/scrape.go:1306 
billede
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: k8s
          tls_config:
            insecure_skip_verify: true
          scrape_interval: 10s
          ...
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  k8sattributes:
    extract:
      metadata:
        - k8s.container.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.deployment.name
        - k8s.replicaset.name
        - k8s.node.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        - k8s.job.name
        - k8s.statefulset.name
      labels:
        - tag_name: k8s.pod.label.app
          key: app
          from: pod
        - tag_name: k8s.pod.label.component
          key: component
          from: pod
        - tag_name: k8s.pod.label.zone
          key: zone
          from: pod
    pod_association:
      - sources:
        - from: resource_attribute
          name: k8s.pod.ip
      - sources:
        - from: resource_attribute
          name: k8s.pod.uid
      - sources:
        - from: connection
  transform/add-workload-label:
    metric_statements:
      - context: datapoint
        statements:
        - set(attributes["kube_workload_name"], resource.attributes["k8s.deployment.name"])
        - set(attributes["kube_workload_name"], resource.attributes["k8s.statefulset.name"])
        - set(attributes["kube_workload_type"], "deployment") where resource.attributes["k8s.deployment.name"] != nil
        - set(attributes["kube_workload_type"], "statefulset") where resource.attributes["k8s.statefulset.name"] != nil
exporters:
  prometheusremotewrite:
    endpoint: ${env:PROMETHEUSREMOTEWRITE_ENDPOINT}
    headers:
      Authorization: ${env:PROMETHEUSREMOTEWRITE_TOKEN}
    resource_to_telemetry_conversion:
      enabled: true
    max_batch_size_bytes: 2000000
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter, batch, k8sattributes, transform/add-workload-label]
      exporters: [prometheusremotewrite]
containers:
- command:
  - /otelcol-contrib
  - --config=/conf/otel-collector-config.yaml
  image: otel/opentelemetry-collector-contrib:0.96.0
  imagePullPolicy: IfNotPresent
  name: otel-collector
  ports:
  - containerPort: 55679
    protocol: TCP
  - containerPort: 4317
    protocol: TCP
  - containerPort: 4318
    protocol: TCP
  - containerPort: 14250
    protocol: TCP
  - containerPort: 14268
    protocol: TCP
  - containerPort: 9411
    protocol: TCP
  - containerPort: 8888
    protocol: TCP
  resources:
    limits:
      cpu: "4"
      memory: 16Gi
    requests:
      cpu: "4"
      memory: 16Gi
  volumeMounts:
  - mountPath: /conf
    name: otel-collector-config-vol
  env:
  - name: "GOMEMLIMIT"
    value: "12GiB" # 80% of memory request
  envFrom:
  - secretRef:
      name: otel-collector

@martinohansen
Copy link

martinohansen commented Apr 10, 2024

Just tested on v0.97, same failure pattern. I did noticed this error message as well:

Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 108155}
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 108341}
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43 

@chenlujjj
Copy link

Met similar error on v0.97

@garg031
Copy link

garg031 commented Jul 22, 2024

I am facing similar issue..

Scrape continouly fails with the below error -

github.com/prometheus/[email protected]/scrape/scrape.go:1306
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/[email protected]/scrape/scrape.go:1429
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
github.com/prometheus/[email protected]/scrape/scrape.go:1351
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "itomperftesting-otel-collector-job", "target": "http://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

This also leads high memory usage at otel-collector..

Do we have any work-around for this ?

@benzch
Copy link

benzch commented Oct 29, 2024

Same issue version 0.100. any work around?

@jbk-descript
Copy link

My org is also experiencing this issue on v0.100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants