Kafka receiver is restarted when Tempo receives traces that are too large #1944

youribonnaffe · 2022-12-09T17:44:28Z

Describe the bug
When Tempo receives a trace that exceeds the max_bytes_per_trace setting and generates a TRACE_TOO_LARGEerror,
the Kafka receiver is restarted leading to latency in trace ingestion.
By restarting I mean that the consumer restarts, it can be observed with the following logs:

level=warn ts=2022-12-09T15:40:50.176677Z caller=grpc_logging.go:43 method=/tempopb.Pusher/PushBytesV2 duration=74.333µs err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (300) exceeded while adding 277 bytes to trace 09e8d40e0c59e5d35360167e1f8725a1 for tenant single-tenant" msg=gRPC
level=error ts=2022-12-09T15:40:50.176906Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (300) exceeded while adding 277 bytes to trace 09e8d40e0c59e5d35360167e1f8725a1 for tenant single-tenant"
ts=2022-12-09T15:40:51Z level=info msg="Starting consumer group" component=tempo partition=0

So let's say you receive spans from a large trace, Tempo will raise the error, restart the consumer, every time a span is added to the trace. This takes time and leads to Tempo lagging behind.

Having large traces is of course an issue but it can happen and it should ideally not affect Tempo's behavior.

I understand that the issue lies between Tempo and the kafkareceiver from OTel but I felt that creating the issue here was appropriate as the error handling is probably meant to be controlled by Tempo as he is raising the error in the first place.

Debugging locally I was able to avoid the problem by ignoring the error in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kafkareceiver/kafka_receiver.go#L448. This is just to point at some code when error handling is happening. Unfortunately I'm not familiar enough with the codebase to propose a fix. In general how is error handling performed in receivers?

To Reproduce
Steps to reproduce the behavior:

Minimal configuration using a Kafka receiver
Set max_bytes_per_trace to a very small value
Start Tempo (#964abf16fc09b6aee09c3cbe574f0ae3456880eb)
Generate a trace with a few spans

Expected behavior
TRACE_TOO_LARGE errors should not impact Tempo's performance and should just be reported and the span ignored.

Environment:

Infrastructure: laptop
Deployment tool: local build of Tempo

Additional Context

The text was updated successfully, but these errors were encountered:

joe-elliott · 2022-12-09T18:22:02Z

Thank you for the report! I'm going to dump what we discovered in our slack convo here.

Tempo returns errors to the OTEL receiver here:

tempo/modules/distributor/receiver/shim.go

Lines 282 to 288 in 93f1211

    
           _, err = r.pusher.PushTraces(ctx, td) 
        
           metricPushDuration.Observe(time.Since(start).Seconds()) 
        
           if err != nil { 
        
           	r.logger.Log("msg", "pusher failed to consume trace data", "err", err) 
        
           } 
        
           return err

The OTEL receiver then exits its loop which is immediately restarted by the kafka client library causing the error message "Starting consumer group":
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/08cd036e9da7fa0678402fe8ce1d4d7bd3234a6a/receiver/kafkareceiver/kafka_receiver.go#L443-L449

It is unclear to me what behavior would be correct here. It feels likely that the OTEL receiver loop should not exit on error and should simply continue attempting to process messages.

Unfortunately, there is a deeper issue here. The OTEL kafka receiver code does not provide a way to distinguish between errors that should or should not be retried. Errors such as TRACE_TOO_LARGE should not be retried where as LIVE_TRACES_EXCEEDED should. So even with the above fix the Kafka receiver will repeatedly slam Tempo with batches that should be dropped (depending on settings).

The best immediate solution I can suggest is to run:
kafka -> otel collector -(grpc)> Tempo

The otel collector will correctly respect the GRPC error codes returned by Tempo and either drop batches or retry them based on the returned error.

github-actions · 2023-02-08T00:03:26Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

Symbianx · 2023-05-05T14:14:02Z

@joe-elliott We're also having this issue on 2.0.1. Is the otel collector in between kafka and tempo still the only option?

joe-elliott · 2023-05-05T16:30:35Z

Unless the OTel Collector code has changed then yes the best option is to use the collector between Kafka and Tempo.

github-actions bot added the stale Used for stale issues / PRs label Feb 8, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka receiver is restarted when Tempo receives traces that are too large #1944

Kafka receiver is restarted when Tempo receives traces that are too large #1944

youribonnaffe commented Dec 9, 2022

joe-elliott commented Dec 9, 2022

github-actions bot commented Feb 8, 2023

Symbianx commented May 5, 2023

joe-elliott commented May 5, 2023

Kafka receiver is restarted when Tempo receives traces that are too large #1944

Kafka receiver is restarted when Tempo receives traces that are too large #1944

Comments

youribonnaffe commented Dec 9, 2022

joe-elliott commented Dec 9, 2022

github-actions bot commented Feb 8, 2023

Symbianx commented May 5, 2023

joe-elliott commented May 5, 2023