Switch distributor->ingester communication to more efficient PushBytes method #430
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does:
This PR switches distributor->ingester communication of trace data to a more efficient PushBytes method. This api is different in that it contains a slice of byte slices with pre-marshaled Batches ([][]byte). The distributor is marshalling the trace data to a byte slice only once (instead of per ingester), and all data is delivered to the ingester in 1 gRPC call (instead of per trace). This has large improvements in both cpu and memory when ReplicationFactor >= 2, but also non-trivial improvement for replication factor = 1.
Background
The main driver for tempo TCO is compute as the actual object storage is very cost effective. Compute is roughly 50/50 for the distributor and ingester layers. As the distributor is mainly a proxy that replicates traffic to the ingesters according to the replication factor this was higher than expected and seemed to be a good area for improvement. Pprof benchmarking the distributor showed that most cpu and memory processing was related to proto marshalling and compression. From reviewing distributor.Push, the current implementation had several deficiencies. For example, when Replication Factor = 2, a gRPC call is made to each ingester for each belonging trace. This incurs:
The new API signature reduces it to the following in theory:
In practice there are larger than expected savings, as less memory churn means less garbage collection.
Performance Analysis
Performance before and after were measured locally and in a dev cluster. A useful docker-compose setup that configures the necessary replication factor is
located here: https://github.com/mdisibio/tempo-load-test/tree/masteradded to the /integration/microservices/ folder. This setup includes cadvisor, grafana, and a dashboard.The main metric of interest is the compute efficiency, which is spans / s / core across all distributor and ingester pods. This metric is computed with promQL of (simplifying)
rate(tempo_receiver_accepted_spans) / rate(container_cpu_usage_seconds_total)
The current e2e and benchmarking tools were not straightforward to measure this, hence the creation of the linked compose setup.
Local testing improvements:
Screenshots:
**Before: **
After
** K8s cluster**
Next Steps
Which issue(s) this PR fixes:
n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]