Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch distributor->ingester communication to more efficient PushBytes method #430

Merged
merged 4 commits into from
Jan 4, 2021

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Dec 23, 2020

What this PR does:
This PR switches distributor->ingester communication of trace data to a more efficient PushBytes method. This api is different in that it contains a slice of byte slices with pre-marshaled Batches ([][]byte). The distributor is marshalling the trace data to a byte slice only once (instead of per ingester), and all data is delivered to the ingester in 1 gRPC call (instead of per trace). This has large improvements in both cpu and memory when ReplicationFactor >= 2, but also non-trivial improvement for replication factor = 1.

Background
The main driver for tempo TCO is compute as the actual object storage is very cost effective. Compute is roughly 50/50 for the distributor and ingester layers. As the distributor is mainly a proxy that replicates traffic to the ingesters according to the replication factor this was higher than expected and seemed to be a good area for improvement. Pprof benchmarking the distributor showed that most cpu and memory processing was related to proto marshalling and compression. From reviewing distributor.Push, the current implementation had several deficiencies. For example, when Replication Factor = 2, a gRPC call is made to each ingester for each belonging trace. This incurs:

  • 2x marshaling per trace
  • 2x compression per trace
  • 2x gRPC calls per trace

The new API signature reduces it to the following in theory:

  • 1x marshaling per trace
  • 2x compression per trace
  • 1x gRPC call per ingester

In practice there are larger than expected savings, as less memory churn means less garbage collection.

Performance Analysis
Performance before and after were measured locally and in a dev cluster. A useful docker-compose setup that configures the necessary replication factor is located here: https://github.com/mdisibio/tempo-load-test/tree/master added to the /integration/microservices/ folder. This setup includes cadvisor, grafana, and a dashboard.

The main metric of interest is the compute efficiency, which is spans / s / core across all distributor and ingester pods. This metric is computed with promQL of (simplifying) rate(tempo_receiver_accepted_spans) / rate(container_cpu_usage_seconds_total)

The current e2e and benchmarking tools were not straightforward to measure this, hence the creation of the linked compose setup.

Local testing improvements:

  • Spans/s/core went from 2700 -> 10-15K, 5x+
  • In dev k8s cluster, went from 2200 -> 5K, 2.5x+

Screenshots:

**Before: **

Screen Shot 2020-12-23 at 2 55 13 PM

After
Screen Shot 2020-12-23 at 3 06 10 PM

** K8s cluster**
Screen Shot 2020-12-23 at 4 53 06 PM

Next Steps

  • Think more about why the savings are so significant. Does gRPC really behave that much differently when operating on a complex object graph instead of simple [][]byte ?
  • There is possibly a more elegant solution in using the experimental gRPC PreparedMsg api. This would allow for both marshalling and compression to be done once per trace. Discussion here: Sending large complex message limits throughput grpc/grpc-go#1879 However this requires a dependency update that we are blocked on.
  • Think about how to incorporate measurement of the spans/s/core metric in CI pipeline and this main repo.
  • Think about how to better test the microservices mode, separate distributors and ingesters, and repl factor > 1.

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@joe-elliott
Copy link
Member

This looks good to me. Excellent performance improvements. Is there a reason not to PR the microservices docker-compose file to this repo?

Since we are 0.x I think we should aggressively remove the old proto/GRPC calls and leave these in place. Perhaps cut a release with this change and then one with the old endpoint removed?

In the changelog can you add a brief description of how to migrate to this new setup? Roll ingesters first completely and then roll distributors?

@mdisibio
Copy link
Contributor Author

mdisibio commented Jan 4, 2021

This looks good to me. Excellent performance improvements. Is there a reason not to PR the microservices docker-compose file to this repo?

The profiling setup felt didn't seem to fit well within the existing docker-compose examples. What about adding it to a new /integration/profiling/ folder?

Since we are 0.x I think we should aggressively remove the old proto/GRPC calls and leave these in place. Perhaps cut a release with this change and then one with the old endpoint removed?
In the changelog can you add a brief description of how to migrate to this new setup? Roll ingesters first completely and then roll distributors?

Agree, two phase release sounds ideal, and will add that info.

@mdisibio mdisibio changed the title WIP: Switch distributor->ingester communication to more efficient PushBytes method Switch distributor->ingester communication to more efficient PushBytes method Jan 4, 2021
@joe-elliott joe-elliott merged commit 9e0e05a into grafana:master Jan 4, 2021
@mdisibio mdisibio deleted the push-bytes branch February 3, 2021 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants