-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alternative prometheus remote write implementation to prometheuesremotewriteexporter
#37284
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
A few questions/points:
The effort to add support for PRW 2.0 is being worked on by @jmichalek132. I would sync up with him on that. If we can keep that effort decoupled from the WAL implementation, that would be preferable. |
I was trying to performance benchmark use the existing wal implementation versus the grafana walqueue. Ran into several issues getting it to work properly, trying to figure out if its my tests or underlying code issues. If I turn off the wal then the benchmark runs fine. My quick and dirty code is here. It's not optimized for otel ingestion but likely good enough for rough comparison. I saw some issues where it seems to either continually send the same data if no new data is written and/or resets the index. Will spend some time to see if I can get an apples to apples comparison. Will write up a longer form has I dig into the code with some of the differences and pros/cons. Thank you for the response! |
Might be worthy of a separate thread but I noticed the |
Pros and Cons of Grafana Wal QueuePros
Cons
OverviewThe Wal Queue is built to be a more generic Prometheus Metric replacement. The previous Prometheus WAL has tight coupling between the scrape, wal and remote write that make certain uses cases hard. It also has issues at high cardinality with memory increase. The WAL Queue is meant to avoid those in a way not to different in concept from the current employed Queue. It uses a more push based instead of polling based system to reduce CPU, but fundamentally it writes to a disk queue and then deseralizes that disk queue. There is support for multiple underlying data formats, the disk writes are wrapped in a a format that adds a metadata dictionary for things like file format, compression, number of records and other fields. This means that on disk file format or compression config change can be made but it will work with whatever is written on disk since the metadata will determine how to deserialize the format. It is written in a way to minimize allocs, memory and CPU pressure and fully supports:
In addition it handles the writes very similar to the traditional prometheus remote write in terms of 5xx, 429s, retry after header, round robin dialer and so on. I am personally more than happy to be involved with the updating and maintenance of the dependency, configuration and issues related to it. |
For specific answers:
|
PrometheusRemoteWrite exporter used a library developed in collector-core called exporter-helper to coordinate parallelism, batching, retries, persistency on disk, and other stuff that collector exporters usually need. But since the exporter-helper couldn't maintain datapoint ordering and Prometheus didn't have OOO support in the past, we were forced to use another WAL implementation (the one we've got today).
I believe so, yes. Even though Prometheus now supports OOO, the performance is worse than in the usual in-order scenario. While exporter-helper doesn't support ordering, using an in-order WAL will be much better for Prometheus.
Would it make sense to use feature flags to coordinate the transition? We keep both while we verify the new works well, then replace once it's stable? |
I could see something like this. If both are enabled, only the new WAL would be written to, and the old one would only be read from. We can leave it in that state for a few releases before removing the old WAL. I would really like to get input from @Aneurysm9, since he implemented the previous WAL in #7304 and open-telemetry/opentelemetry-collector#3597. |
Component(s)
exporter/prometheusremotewrite
Is your feature request related to a problem? Please describe.
For Grafana Alloy I have been working on an alternative approach to the prometheus remote write WAL.
Describe the solution you'd like
Would it be reasonable to add the this new remote implementation into
prometheusremotewriteexporter
, the config options would be significantly different. Than the current WAL config so would need to only declare one or the other. Its currently experimental in Alloy but we have been using it internally and alongside several other users.We haven't had any issues scaling it to a few million series writes a second, also looking into support for the new RW format that would mean only updating the library along with general support.
The underlying code can be found at https://github.com/grafana/walqueue
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: