-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka receiver is restarted when Tempo receives traces that are too large #1944
Comments
Thank you for the report! I'm going to dump what we discovered in our slack convo here. Tempo returns errors to the OTEL receiver here: tempo/modules/distributor/receiver/shim.go Lines 282 to 288 in 93f1211
The OTEL receiver then exits its loop which is immediately restarted by the kafka client library causing the error message "Starting consumer group": It is unclear to me what behavior would be correct here. It feels likely that the OTEL receiver loop should not exit on error and should simply continue attempting to process messages. Unfortunately, there is a deeper issue here. The OTEL kafka receiver code does not provide a way to distinguish between errors that should or should not be retried. Errors such as The best immediate solution I can suggest is to run: The otel collector will correctly respect the GRPC error codes returned by Tempo and either drop batches or retry them based on the returned error. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
@joe-elliott We're also having this issue on 2.0.1. Is the otel collector in between kafka and tempo still the only option? |
Unless the OTel Collector code has changed then yes the best option is to use the collector between Kafka and Tempo. |
Describe the bug
When Tempo receives a trace that exceeds the
max_bytes_per_trace
setting and generates aTRACE_TOO_LARGE
error,the Kafka receiver is restarted leading to latency in trace ingestion.
By restarting I mean that the consumer restarts, it can be observed with the following logs:
So let's say you receive spans from a large trace, Tempo will raise the error, restart the consumer, every time a span is added to the trace. This takes time and leads to Tempo lagging behind.
Having large traces is of course an issue but it can happen and it should ideally not affect Tempo's behavior.
I understand that the issue lies between Tempo and the kafkareceiver from OTel but I felt that creating the issue here was appropriate as the error handling is probably meant to be controlled by Tempo as he is raising the error in the first place.
Debugging locally I was able to avoid the problem by ignoring the error in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kafkareceiver/kafka_receiver.go#L448. This is just to point at some code when error handling is happening. Unfortunately I'm not familiar enough with the codebase to propose a fix. In general how is error handling performed in receivers?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
TRACE_TOO_LARGE errors should not impact Tempo's performance and should just be reported and the span ignored.
Environment:
Additional Context
The text was updated successfully, but these errors were encountered: