Add alternative prometheus remote write implementation to `prometheuesremotewriteexporter` #37284

mattdurham · 2025-01-17T14:22:25Z

Component(s)

exporter/prometheusremotewrite

Is your feature request related to a problem? Please describe.

For Grafana Alloy I have been working on an alternative approach to the prometheus remote write WAL.

Describe the solution you'd like

Would it be reasonable to add the this new remote implementation into prometheusremotewriteexporter, the config options would be significantly different. Than the current WAL config so would need to only declare one or the other. Its currently experimental in Alloy but we have been using it internally and alongside several other users.

We haven't had any issues scaling it to a few million series writes a second, also looking into support for the new RW format that would mean only updating the library along with general support.

The underlying code can be found at https://github.com/grafana/walqueue

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-17T14:22:40Z

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole @ArthurSens

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2025-01-17T14:56:59Z

A few questions/points:

What are the benefits of the new WAL compared to the existing WAL implementation? We should spell those out clearly.
We should not maintain multiple WAL implementations long-term. So if we plan to adopt a new WAL, we should plan for it to replace the existing WAL implementation.
The above means we need a migration plan. If someone updates their collector version without changing configuration, we should not drop telemetry stored in the "legacy" WAL.
I the WAL was initially implemented because we needed in-order points and the exporter helper's persistent queue doesn't preserve order.
- Given Prometheus has started supporting OOO samples, do we still need a WAL? (I think the answer is yes, but we should think this through.
- Should the WAL be added to exporterhelper?

The effort to add support for PRW 2.0 is being worked on by @jmichalek132. I would sync up with him on that. If we can keep that effort decoupled from the WAL implementation, that would be preferable.

mattdurham · 2025-01-19T18:29:05Z

I was trying to performance benchmark use the existing wal implementation versus the grafana walqueue. Ran into several issues getting it to work properly, trying to figure out if its my tests or underlying code issues. If I turn off the wal then the benchmark runs fine. My quick and dirty code is here. It's not optimized for otel ingestion but likely good enough for rough comparison.

I saw some issues where it seems to either continually send the same data if no new data is written and/or resets the index. Will spend some time to see if I can get an apples to apples comparison.

Will write up a longer form has I dig into the code with some of the differences and pros/cons.

Thank you for the response!

mattdurham · 2025-01-20T15:16:29Z

Might be worthy of a separate thread but I noticed the exportThenFrontTruncateWAL after the request is exported, then the retrieveWALIndices the wal is closed then reopened and the indices are set. This appears to cause issues. I added a print statement and can see it rolls back the rIndex to a previous number which feels wrong. r index was 11 now is 10

mattdurham · 2025-01-20T16:01:48Z

Pros and Cons of Grafana Wal Queue

Pros

Alloy uses and it will transition to the default prometheus remote write. This means that it will be kept up to date and will be heavily used and tested. Along with Grafana internally meaning bugs, issues and CVEs will be handled quickly.
Will be kept up to date.
Will be performant and handle large scale. For instance going to be adding elastic scaling based on metrics, incoming flow and outgoing flow soon.
Things like OOO samples are handled natively along with native histgrams, though unsure yet if that maps.
It is not inherently in Alloy but is a separate repository to keep dependencies minimal.

Cons

Its another dependency.
Requires manual updating for the dependency.
Its more complex, hopefully this is hidden by a simple api layer but if underneath its harder.
Config mapping is always a pain.

Overview

The Wal Queue is built to be a more generic Prometheus Metric replacement. The previous Prometheus WAL has tight coupling between the scrape, wal and remote write that make certain uses cases hard. It also has issues at high cardinality with memory increase. The WAL Queue is meant to avoid those in a way not to different in concept from the current employed Queue. It uses a more push based instead of polling based system to reduce CPU, but fundamentally it writes to a disk queue and then deseralizes that disk queue.

There is support for multiple underlying data formats, the disk writes are wrapped in a a format that adds a metadata dictionary for things like file format, compression, number of records and other fields. This means that on disk file format or compression config change can be made but it will work with whatever is written on disk since the metadata will determine how to deserialize the format.

It is written in a way to minimize allocs, memory and CPU pressure and fully supports:

TTL
Replayability
Several Auth Schemes
Autoscaling (Coming soon)

In addition it handles the writes very similar to the traditional prometheus remote write in terms of 5xx, 429s, retry after header, round robin dialer and so on.

I am personally more than happy to be involved with the updating and maintenance of the dependency, configuration and issues related to it.

mattdurham · 2025-01-20T16:07:13Z

For specific answers:

I think we can transition relatively easily, either via a new config section or possibly mapping the config along with some new optional configuration. If need be we can read the old wal and import it into the new wal since its prompb.writerequest.
WAL is good for handling network downtime or restart replayability so its still needed.
Unsure what you mean about the export helper, are you speaking of the prometheus exporter?

ArthurSens · 2025-01-20T18:30:45Z

Unsure what you mean about the export helper, are you speaking of the prometheus exporter?

PrometheusRemoteWrite exporter used a library developed in collector-core called exporter-helper to coordinate parallelism, batching, retries, persistency on disk, and other stuff that collector exporters usually need. But since the exporter-helper couldn't maintain datapoint ordering and Prometheus didn't have OOO support in the past, we were forced to use another WAL implementation (the one we've got today).

Given Prometheus has started supporting OOO samples, do we still need a WAL?

I believe so, yes. Even though Prometheus now supports OOO, the performance is worse than in the usual in-order scenario. While exporter-helper doesn't support ordering, using an in-order WAL will be much better for Prometheus.

we should not drop telemetry stored in the "legacy" WAL.

Would it make sense to use feature flags to coordinate the transition? We keep both while we verify the new works well, then replace once it's stable?

dashpole · 2025-01-22T16:37:13Z

Would it make sense to use feature flags to coordinate the transition? We keep both while we verify the new works well, then replace once it's stable?

I could see something like this. If both are enabled, only the new WAL would be written to, and the old one would only be read from. We can leave it in that state for a few releases before removing the old WAL.

I would really like to get input from @Aneurysm9, since he implemented the previous WAL in #7304 and open-telemetry/opentelemetry-collector#3597.

mattdurham added enhancement New feature or request needs triage New item requiring triage labels Jan 17, 2025

github-actions bot added the exporter/prometheusremotewrite label Jan 17, 2025

dashpole removed the needs triage New item requiring triage label Jan 17, 2025

github-actions bot mentioned this issue Jan 21, 2025

Weekly Report: 2025-01-14 - 2025-01-21 #37358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternative prometheus remote write implementation to `prometheuesremotewriteexporter` #37284

Add alternative prometheus remote write implementation to `prometheuesremotewriteexporter` #37284

mattdurham commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

dashpole commented Jan 17, 2025

mattdurham commented Jan 19, 2025

mattdurham commented Jan 20, 2025

mattdurham commented Jan 20, 2025 •

edited

Loading

mattdurham commented Jan 20, 2025

ArthurSens commented Jan 20, 2025

dashpole commented Jan 22, 2025

Add alternative prometheus remote write implementation to prometheuesremotewriteexporter #37284

Add alternative prometheus remote write implementation to prometheuesremotewriteexporter #37284

Comments

mattdurham commented Jan 17, 2025

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jan 17, 2025

dashpole commented Jan 17, 2025

mattdurham commented Jan 19, 2025

mattdurham commented Jan 20, 2025

mattdurham commented Jan 20, 2025 • edited Loading

Pros and Cons of Grafana Wal Queue

Pros

Cons

Overview

mattdurham commented Jan 20, 2025

ArthurSens commented Jan 20, 2025

dashpole commented Jan 22, 2025

Add alternative prometheus remote write implementation to `prometheuesremotewriteexporter` #37284

Add alternative prometheus remote write implementation to `prometheuesremotewriteexporter` #37284

mattdurham commented Jan 20, 2025 •

edited

Loading