Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewriteexporter] Prometheus remote write exporter does not retry on 5xx #20304

Closed
rapphil opened this issue Mar 24, 2023 · 8 comments
Labels

Comments

@rapphil
Copy link
Contributor

rapphil commented Mar 24, 2023

Component(s)

exporter/prometheusremotewrite

What happened?

Description

The prometheusremotewriteexporter is failing the conformance tests of prometheus.

--- FAIL: TestRemoteWrite (24.70s)
    --- FAIL: TestRemoteWrite/otelcollector (0.01s)
        --- PASS: TestRemoteWrite/otelcollector/Up (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Counter (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/JobLabel (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Histogram (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Staleness (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Invalid (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/EmptyLabels (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/Headers (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/HonorLabels (10.05s)
        --- PASS: TestRemoteWrite/otelcollector/NameLabel (10.05s)
        --- PASS: TestRemoteWrite/otelcollector/Timestamp (10.05s)
        --- PASS: TestRemoteWrite/otelcollector/Summary (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/SortedLabels (10.01s)
        --- FAIL: TestRemoteWrite/otelcollector/Retries500 (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/RepeatedLabels (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Retries400 (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/InstanceLabel (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Gauge (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Ordering (14.66s)
FAIL
FAIL    github.com/prometheus/compliance/remote_write   25.036s
FAIL

Steps to Reproduce

[email protected]:prometheus/compliance.git
cd compliance/remote_write_sender
go test --tags=compliance -run "TestRemoteWrite/otelcollector/.+" -v ./

Expected Result

The exporter should retry on 5xx errors.

Actual Result

The exporter does not retry 5xx. It is treating it in the same way as 4xx.

Collector version

this regression was inserted starting with v0.45.0

Environment information

RHEL based.

OpenTelemetry Collector configuration

NA

Log output

NA

Additional context

NA

@rapphil rapphil added bug Something isn't working needs triage New item requiring triage labels Mar 24, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Aneurysm9 Aneurysm9 added priority:p2 Medium and removed needs triage New item requiring triage labels Mar 24, 2023
@rapphil
Copy link
Contributor Author

rapphil commented Apr 13, 2023

I would like to propose that we add a retry in the moment a batch of metrics is exported: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/exporter.go#L250

This would increase the risk of out of order samples, because the component can operate with multiple workers that export the batches in parallel: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/exporter.go#L202

To eliminate the risk of out of order samples, I also propose that we change how we submit data in parallel. Instead of just submitting the batches in parallel, we should try to make sure that a single time series is always submitted to prometheus sequentially. The idea is that you can only submit samples for different time series in parallel. Samples for the same time series should always be written sequentially.

To achieve this I propose that:

  • We create one queue per worker. Each worker will consume from its own queue.
  • We partition the data that will be submitted to each of the queues so that a single time series will only be submitted to a specific queue. This can be a hash(labelset) modulo NUMBER_QUEUES.
  • each worker can retry independently on a 5xx status code.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 14, 2023
@rapphil
Copy link
Contributor Author

rapphil commented Aug 14, 2023

There is still work being done for this issue in #23842

@rapphil
Copy link
Contributor Author

rapphil commented Aug 14, 2023

/label -stale

@github-actions github-actions bot removed the Stale label Aug 15, 2023
dmitryax pushed a commit that referenced this issue Aug 29, 2023
…ing `cenkalti/backoff` client (#23842)

**Description:** Fix the Retry on 5xx status code using
`cenkalti/backoff` package and added unit test.

Thanks to @rapphil for proposing the
[solution](#20304 (comment)).

**Link to tracking Issue:** #20304 

**Testing:** Added unit tests for retryOn5xx and noRetryOn4xx

Prometheus compliance test results -

```
--- PASS: TestRemoteWrite (20.10s)
    --- PASS: TestRemoteWrite/otelcollector (0.01s)
        --- PASS: TestRemoteWrite/otelcollector/NameLabel (10.05s)
        --- PASS: TestRemoteWrite/otelcollector/Invalid (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/RepeatedLabels (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/Retries400 (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/SortedLabels (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/Up (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/JobLabel (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/HonorLabels (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/EmptyLabels (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/Counter (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/InstanceLabel (10.06s)
        --- PASS: TestRemoteWrite/otelcollector/Ordering (17.10s)
        --- PASS: TestRemoteWrite/otelcollector/Histogram (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Gauge (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Headers (10.02s)
        --- PASS: TestRemoteWrite/otelcollector/Retries500 (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/Staleness (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/Summary (10.03s)
        --- PASS: TestRemoteWrite/otelcollector/Timestamp (10.03s)
PASS
ok      github.com/prometheus/compliance/remote_write   20.368s
```

---------

Co-authored-by: Anthony Mirabella <[email protected]>
@vasireddy99
Copy link
Contributor

The PR is merged - #23842 and this issue can be closed .

@rapphil
Copy link
Contributor Author

rapphil commented Sep 15, 2023

Closing as the PR was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants