Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gathered hostmetrics process shown in console but not as metric in prometheus #36496

Closed
securom1987 opened this issue Nov 22, 2024 · 16 comments
Closed

Comments

@securom1987
Copy link

Component(s)

receiver/hostmetrics

What happened?

Description

As mentioned in description i am using otel collector v0.114 and hostmetrics receiver with processscraper in ubuntu linux.
I want to scrape process information. These are shown in debug output / console for example:

Console output

Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.owner: Str(root)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: InstrumentationScope github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver/internal/scraper/processscraper 0.114.0
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> Name: process.cpu.time
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> Name: process.memory.usage
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> Name: process.memory.virtual
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.pid: Int(616072)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.parent_pid: Int(1)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.executable.name: Str(loki)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.executable.path: Str()
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.command: Str(/usr/bin/loki)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.command_line: Str(/usr/bin/loki -config.file /etc/loki/config.yml)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: -> process.owner: Str(loki)

based on this otel-collector config:

extensions:
health_check:
endpoint: 0.0.0.0:1133

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 10s
scrapers:
# CPU utilization metrics
#cpu:
# Disk I/O metrics
# disk:
# File System utilization metrics
#filesystem:
# CPU load metrics
#load:
# Memory utilization metrics
#memory:
# Network interface I/O metrics & TCP connection metrics
#network:
# Paging/Swap space utilization and I/O metrics
#paging:
# Process count metrics
process:
# Per process CPU, Memory, and Disk I/O metrics
processes:

processors:
batch:
resource:
attributes:
- action: insert
key: service.name ## setzt im Grafana in der Metrik die job=HOST1
value: NUC-CLOUD

exporters:
debug:
verbosity: detailed
prometheus:
endpoint: 0.0.0.0:8889

service:
extensions: [health_check]
pipelines:
metrics:
receivers: [otlp, hostmetrics]
processors: [resource, batch]
exporters: [debug, prometheus]

The problem:

The metrics which are written to console are not shown in prometheus.

Collector version

v0.114.0

Environment information

Environment

OS: (e.g., "Ubuntu 24.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

extensions:
  health_check:
    endpoint: 0.0.0.0:1133

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  hostmetrics:
    collection_interval: 10s
    scrapers:
      # CPU utilization metrics
      #cpu:
      # Disk I/O metrics
      # disk:
      # File System utilization metrics
      #filesystem:
      # CPU load metrics
      #load:
      # Memory utilization metrics
      #memory:
      # Network interface I/O metrics & TCP connection metrics
      #network:
      # Paging/Swap space utilization and I/O metrics
      #paging:
      # Process count metrics
      process:
      # Per process CPU, Memory, and Disk I/O metrics
      processes:

processors:
  batch:
  resource:
    attributes:
      - action: insert
        key: service.name           ## setzt im Grafana in der Metrik die job=HOST1
        value: NUC-CLOUD

exporters:
  debug:
    verbosity: detailed
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [resource, batch]
      exporters: [debug, prometheus]

Log output

Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.owner: Str(root)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]: InstrumentationScope github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver/internal/scraper/processscraper 0.114.0
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> Name: process.cpu.time
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> Name: process.memory.usage
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> Name: process.memory.virtual
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.pid: Int(616072)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.parent_pid: Int(1)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.executable.name: Str(loki)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.executable.path: Str()
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.command: Str(/usr/bin/loki)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.command_line: Str(/usr/bin/loki -config.file /etc/loki/config.yml)
Nov 22 09:37:35 nuc-cloud otelcol-contrib[1156080]:      -> process.owner: Str(loki)

Additional context

Metrics which are shown in console/debug log are not shown in prometheus.
For example process "loki" in log output

@securom1987 securom1987 added bug Something isn't working needs triage New item requiring triage labels Nov 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@securom1987
Copy link
Author

securom1987 commented Nov 22, 2024

/help-wanted receiver/hostmetrics

@VihasMakwana
Copy link
Contributor

@securom1987 do you see any errors logged from the prometheus exporter?

Can you:

  • Disable debug processor (as you've confirmed that metrics get logged)
  • Use service::telemetry::logs::level: info and see if you get any hints?

@tdg5
Copy link

tdg5 commented Nov 22, 2024

@securom1987, the format of metric names in OTLP disagrees with the format of metric names in prometheus.

I have no experience with the prometheus exporter, but the prometheusremotewrite exporter has a config that handles translating the OTLP metric names to prometheus friendly names, so you might try the prometheusremotewrite exporter instead. Alternatively, you could look for a similar configuration on the prometheus exporter.

It's clunkier, but this workaround would probably also work if you don't mind having to explicitly map each metric tag.

@securom1987
Copy link
Author

@securom1987 do you see any errors logged from the prometheus exporter?

Can you:

* Disable `debug` processor (as you've confirmed that metrics get logged)

* Use `service::telemetry::logs::level: info` and see if you get any hints?

Hi, thank you for your reply:

Here is the output written to syslog:
Nov 25 10:24:57 nuc-cloud otelcol-contrib[1376848]: 2024-11-25T10:24:57.173+0100#011error#011scraperhelper/scrapercontroller.go:206#011
Error scraping metrics#011{"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error":
"error reading process executable for pid 1: readlink /proc/1/exe: permission denied;
error reading process executable for pid 2: readlink /proc/2/exe: permission denied;
error reading process executable for pid 3: readlink /proc/3/exe: permission denied;
error reading process executable for pid 4: readlink /proc/4/exe: permission denied;
error reading process executable for pid 5: readlink /proc/5/exe: permission denied;
error reading process executable for pid 6: readlink /proc/6/exe: permission denied;
error reading process executable for pid 8: readlink /proc/8/exe: permission denied;
....
error reading process executable for pid 231: readlink /proc/231/exe: permission denied;
Nov 25 10:24:57 nuc-cloud otelcol-contrib[1376848]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
Nov 25 10:24:57 nuc-cloud otelcol-contrib[1376848]: #011go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:206
Nov 25 10:24:57 nuc-cloud otelcol-contrib[1376848]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
Nov 25 10:24:57 nuc-cloud otelcol-contrib[1376848]: #011go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:183

Only processes from /proc can not be scraped in my opinion...
Switching back on the debug exporter and enabling it, other user processes can be scraped as seen on the console output.

Copy link
Contributor

Pinging code owners for exporter/prometheus: @Aneurysm9 @dashpole @ArthurSens. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@braydonk
Copy link
Contributor

Can you send an example of the Prometheus scrape? Is it empty?

@securom1987
Copy link
Author

securom1987 commented Nov 25, 2024

Can you send an example of the Prometheus scrape? Is it empty?

Do you mean its config file?

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "otel-collector-Gateway"
    scrape_interval: 5s
    honor_labels: true
    static_configs:
      - targets: ["localhost:8889"]`

@VihasMakwana VihasMakwana removed the needs triage New item requiring triage label Nov 25, 2024
@braydonk
Copy link
Contributor

I did not mean the scrape config, though good to know anyway. What I meant was if you curl the prometheus endpoint you started with the collector, what is the output?

@securom1987
Copy link
Author

securom1987 commented Nov 25, 2024

After curling prometheus remote write endpoint: "curl -X POST http://nuc-cloud:9090/api/v1/write" i get "snappy: corrupt input" as an answer. So it should work but think i have to translate the otlp metrics to prometheus friendly metrics as tdg5 already mentioned. But i have no clou how to do that.

@braydonk
Copy link
Contributor

I am referring to the otel-collector-Gateway target in your scrape config. In your OpenTelemetry Collector configuration from your initial issue comment, your Prometheus exporter starts on localhost:8889. I'm not sure how your networking is set up, but if I was going to get the metrics from this exporter locally I would do this:

curl http://localhost:8889/metrics

The localhost may need to be a different host name depending on your setup. But what I want to know primarily is how you know the metrics aren't showing up, which is why I'd like to see the raw output of that Prometheus exporter.

@securom1987
Copy link
Author

Otel Collector and Prometheus are running on the same host.
Thats the result of curling my otel collector metrics according to my initial config:

curl http://nuc-cloud:8889/metrics
# HELP process_cpu_time_seconds_total Total CPU seconds broken down by different states.
# TYPE process_cpu_time_seconds_total counter
process_cpu_time_seconds_total{job="NUC-CLOUD",state="system"} 0
process_cpu_time_seconds_total{job="NUC-CLOUD",state="user"} 0
process_cpu_time_seconds_total{job="NUC-CLOUD",state="wait"} 0
# HELP process_disk_io_bytes_total Disk bytes transferred.
# TYPE process_disk_io_bytes_total counter
process_disk_io_bytes_total{direction="read",job="NUC-CLOUD"} 1.32905e+06
process_disk_io_bytes_total{direction="write",job="NUC-CLOUD"} 3807
# HELP process_memory_usage_bytes The amount of physical memory in use.
# TYPE process_memory_usage_bytes gauge
process_memory_usage_bytes{job="NUC-CLOUD"} 1.077248e+06
# HELP process_memory_virtual_bytes Virtual memory size.
# TYPE process_memory_virtual_bytes gauge
process_memory_virtual_bytes{job="NUC-CLOUD"} 5.873664e+06
# HELP system_processes_count Total number of processes in each state.
# TYPE system_processes_count gauge
system_processes_count{job="NUC-CLOUD",status="blocked"} 0
system_processes_count{job="NUC-CLOUD",status="idle"} 76
system_processes_count{job="NUC-CLOUD",status="running"} 1
system_processes_count{job="NUC-CLOUD",status="sleeping"} 137
# HELP system_processes_created_total Total number of created processes.
# TYPE system_processes_created_total counter
system_processes_created_total{job="NUC-CLOUD"} 1.471905e+06

@braydonk
Copy link
Contributor

I understand the problem now.

The process scraper structures the process metrics as a collection of Resources, with the attributes that identify the process going in the resource, and the metrics under that resource being the actual metrics for that process. The prometheus exporter handles resources in a very specific way by default that isn't compatible with the way most of the metrics produced by the hostmetrics receiver are structured.

The prometheus exporter has a config option called resource_to_telemetry_conversion that will flatten all the resource attributes into each metric itself. This will have the effect you're after.

Try changing your prometheus exporter config to the following:

prometheus:
  resource_to_telemetry_conversion:
    enabled: true
  endpoint: 0.0.0.0:8889

For reference, I used this config locally to verify:

receivers:
  hostmetrics:
    scrapers:
      process:

exporters:
  prometheus:
    resource_to_telemetry_conversion:
      enabled: true
    endpoint: "localhost:9090"

service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [prometheus]

@ArthurSens
Copy link
Member

@braydonk's suggestion should do the trick! Another approach, if you prefer, is to send OTLP directly to Prometheus: https://prometheus.io/docs/guides/opentelemetry/

If you go in that direction, you'll want to take a look at promote_resource_attributes which will transform specific resources into Prometheus labels

@securom1987
Copy link
Author

I understand the problem now.

The process scraper structures the process metrics as a collection of Resources, with the attributes that identify the process going in the resource, and the metrics under that resource being the actual metrics for that process. The prometheus exporter handles resources in a very specific way by default that isn't compatible with the way most of the metrics produced by the hostmetrics receiver are structured.

The prometheus exporter has a config option called resource_to_telemetry_conversion that will flatten all the resource attributes into each metric itself. This will have the effect you're after.

Try changing your prometheus exporter config to the following:

prometheus:
  resource_to_telemetry_conversion:
    enabled: true
  endpoint: 0.0.0.0:8889

For reference, I used this config locally to verify:

receivers:
  hostmetrics:
    scrapers:
      process:

exporters:
  prometheus:
    resource_to_telemetry_conversion:
      enabled: true
    endpoint: "localhost:9090"

service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [prometheus]

I tried this configuration also with the prometheusremotewrite exporter before and i worked very well, so the addition of

resource_to_telemetry_conversion
  enabled: true

did the trick!
Also in combination with prometheus exporter, it works for me too. So i will stick to prometheus exporter only and drop prometheusremotewrite.

So following configuration is working nearly the same

exporters:
  debug:                        
    verbosity: detailed
  prometheus:                   
    resource_to_telemetry_conversion:
      enabled: true
    endpoint: 0.0.0.0:8889
  #prometheusremotewrite:
    #endpoint: "http://nuc-cloud:9090/api/v1/write"
    #resource_to_telemetry_conversion:
      #enabled: true 

Thank you for your help!

@securom1987
Copy link
Author

Works as described in last comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants