Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka receiver is restarted when Tempo receives traces that are too large #1944

Closed
youribonnaffe opened this issue Dec 9, 2022 · 4 comments
Labels
stale Used for stale issues / PRs

Comments

@youribonnaffe
Copy link

Describe the bug
When Tempo receives a trace that exceeds the max_bytes_per_trace setting and generates a TRACE_TOO_LARGEerror,
the Kafka receiver is restarted leading to latency in trace ingestion.
By restarting I mean that the consumer restarts, it can be observed with the following logs:

level=warn ts=2022-12-09T15:40:50.176677Z caller=grpc_logging.go:43 method=/tempopb.Pusher/PushBytesV2 duration=74.333µs err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (300) exceeded while adding 277 bytes to trace 09e8d40e0c59e5d35360167e1f8725a1 for tenant single-tenant" msg=gRPC
level=error ts=2022-12-09T15:40:50.176906Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (300) exceeded while adding 277 bytes to trace 09e8d40e0c59e5d35360167e1f8725a1 for tenant single-tenant"
ts=2022-12-09T15:40:51Z level=info msg="Starting consumer group" component=tempo partition=0

So let's say you receive spans from a large trace, Tempo will raise the error, restart the consumer, every time a span is added to the trace. This takes time and leads to Tempo lagging behind.

Having large traces is of course an issue but it can happen and it should ideally not affect Tempo's behavior.

I understand that the issue lies between Tempo and the kafkareceiver from OTel but I felt that creating the issue here was appropriate as the error handling is probably meant to be controlled by Tempo as he is raising the error in the first place.

Debugging locally I was able to avoid the problem by ignoring the error in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kafkareceiver/kafka_receiver.go#L448. This is just to point at some code when error handling is happening. Unfortunately I'm not familiar enough with the codebase to propose a fix. In general how is error handling performed in receivers?

To Reproduce
Steps to reproduce the behavior:

  1. Minimal configuration using a Kafka receiver
  2. Set max_bytes_per_trace to a very small value
  3. Start Tempo (#964abf16fc09b6aee09c3cbe574f0ae3456880eb)
  4. Generate a trace with a few spans

Expected behavior
TRACE_TOO_LARGE errors should not impact Tempo's performance and should just be reported and the span ignored.

Environment:

  • Infrastructure: laptop
  • Deployment tool: local build of Tempo

Additional Context

@joe-elliott
Copy link
Member

Thank you for the report! I'm going to dump what we discovered in our slack convo here.

Tempo returns errors to the OTEL receiver here:

_, err = r.pusher.PushTraces(ctx, td)
metricPushDuration.Observe(time.Since(start).Seconds())
if err != nil {
r.logger.Log("msg", "pusher failed to consume trace data", "err", err)
}
return err

The OTEL receiver then exits its loop which is immediately restarted by the kafka client library causing the error message "Starting consumer group":
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/08cd036e9da7fa0678402fe8ce1d4d7bd3234a6a/receiver/kafkareceiver/kafka_receiver.go#L443-L449

It is unclear to me what behavior would be correct here. It feels likely that the OTEL receiver loop should not exit on error and should simply continue attempting to process messages.

Unfortunately, there is a deeper issue here. The OTEL kafka receiver code does not provide a way to distinguish between errors that should or should not be retried. Errors such as TRACE_TOO_LARGE should not be retried where as LIVE_TRACES_EXCEEDED should. So even with the above fix the Kafka receiver will repeatedly slam Tempo with batches that should be dropped (depending on settings).

The best immediate solution I can suggest is to run:
kafka -> otel collector -(grpc)> Tempo

The otel collector will correctly respect the GRPC error codes returned by Tempo and either drop batches or retry them based on the returned error.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Feb 8, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2023
@Symbianx
Copy link

Symbianx commented May 5, 2023

@joe-elliott We're also having this issue on 2.0.1. Is the otel collector in between kafka and tempo still the only option?

@joe-elliott
Copy link
Member

Unless the OTel Collector code has changed then yes the best option is to use the collector between Kafka and Tempo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

3 participants