Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processor concurrency and error transmission #11308

Open
jmacd opened this issue Sep 30, 2024 · 2 comments · May be fixed by #11947
Open

Batch processor concurrency and error transmission #11308

jmacd opened this issue Sep 30, 2024 · 2 comments · May be fixed by #11947

Comments

@jmacd
Copy link
Contributor

jmacd commented Sep 30, 2024

Description
The OTel-Arrow project forked the batch processor component and extended it with support for concurrency and error transmission. We have been satisfied with this approach and wish to contribute the whole set of features back to the core repository and use feature-flags to transition away from the legacy behaviors.

The functional changes include:

  • Tracing support: new root spans are linked with incoming requests; incoming requests are linked without outgoing exports
  • Concurrent export behavior
  • Synchronous error transmission

The existing component differs from our desired behavior in the second two of these points--it is limited to 1 concurrent export and does not transmit errors back. In this issue, we propose the use of feature-flags to transition from both of these legacy behaviors, which run counter to our performance and reliability goals.

The proposed configuration changes:

  • early_return (current default: true; eventual default: false)
  • max_concurrency (current default: 1; eventual default: unlimited)

The changes here were developed downstream, see https://github.com/open-telemetry/otel-arrow/blob/main/collector/processor/concurrentbatchprocessor

Link to tracking issue
Here are a number of issues that are connected with this one, both issues pointing to unmet needs in the exporterhelper batcher and missing features in the legacy batcher. Accepting these changes will allow significantly improved batching support in the interim period until the new batching support is complete.

Part of #10825 -- until this features is complete, users who depend on metadata_keys in the batch processor will not be able to upgrade
Part of #10368 -- I see this as the root cause, we haven't been able to introduce concurrency to the exporterhelper w/o also introducing a queue, which interferes with error transmission
Fixes #8245 -- my original report about the problem solved here -- we add concurrency with batching and error transmission and do not depend on a queue (persistent or in-memory)
Part of #9591 -- users must use one of the available memory limiting mechanisms in conjunction with the batch processor
Part of #8122 -- until this is finished, users depend on the original batch processor
Part of #7460 -- another statement of #8245; the batch processor does not propagate errors today, and this fixes the batch processor's contribution to the problem.

bogdandrutu added a commit that referenced this issue Sep 30, 2024
#### Description

Split the internal `batcher.export()` interface into three methods. This
is a refactoring that was applied in
https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor
and is being back-ported as part of #11308. The reason this refactoring
is needed is that the parent context of the export() request will be
manipulated in common code (vs signal-specific code) for tracing
support.

#### Link to tracking issue
Part of #11308

#### Testing
Existing tests cover this.

---------

Co-authored-by: Bogdan Drutu <[email protected]>
bogdandrutu pushed a commit that referenced this issue Oct 1, 2024
…d names (#11314)

#### Description
There are two implementations of the `batcher` interface with several
inconsistencies. Notably, the single-shard case would start a goroutine
before `Start()` is called, which can be confusing when the goroutine
leak checker notices it. This makes the single-shard batcher wait until
Start() to create its single shard. A confusing field name `batcher` in
this case becomes `single`, and a new field is added for use in start
named `processor` to create the new shard.

For the multi-shard batcher, `start()` does nothing, but this structure
confusingly embeds the batchProcessor. Rename the field `processor` for
consistency with the single-shard case.

#### Link to tracking issue
Part of #11308.
jackgopack4 pushed a commit to jackgopack4/opentelemetry-collector that referenced this issue Oct 8, 2024
#### Description

Split the internal `batcher.export()` interface into three methods. This
is a refactoring that was applied in
https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor
and is being back-ported as part of open-telemetry#11308. The reason this refactoring
is needed is that the parent context of the export() request will be
manipulated in common code (vs signal-specific code) for tracing
support.

#### Link to tracking issue
Part of open-telemetry#11308

#### Testing
Existing tests cover this.

---------

Co-authored-by: Bogdan Drutu <[email protected]>
jackgopack4 pushed a commit to jackgopack4/opentelemetry-collector that referenced this issue Oct 8, 2024
…d names (open-telemetry#11314)

#### Description
There are two implementations of the `batcher` interface with several
inconsistencies. Notably, the single-shard case would start a goroutine
before `Start()` is called, which can be confusing when the goroutine
leak checker notices it. This makes the single-shard batcher wait until
Start() to create its single shard. A confusing field name `batcher` in
this case becomes `single`, and a new field is added for use in start
named `processor` to create the new shard.

For the multi-shard batcher, `start()` does nothing, but this structure
confusingly embeds the batchProcessor. Rename the field `processor` for
consistency with the single-shard case.

#### Link to tracking issue
Part of open-telemetry#11308.
@bogdandrutu
Copy link
Member

As one of the maintainer who tries to review this changes, I am sure everyone understands why the multi-consumer helps and is a useful capability to support, but I cannot understand why the Synchronous error transmission is important to support and who would benefit.

This issue comes to the community as we did "foo", it works, you should use it and support it. I would like to better understand what we try to achieve, and how this will help our customers.

Thanks for the detailed issue.

@bogdandrutu
Copy link
Member

To clarify a bit more, I had a conversation about this with @jmacd and what I understood was that we want to "block" the incoming request to signal LBs that this instance is busy and potentially trigger auto-scale or LB marking this as busy to avoid overwhelming.

Then, @jmacd send a overly complicated implementation to achieve just this (so, unless I better understand other reasons I think the proposed solution is too complicated): The proposed solution also tries to count the number of "failed to emit data". This number cannot be correctly calculated as proposed, because data may be filtered, or request may be queued and drop later, etc. This can happen even without filter processor, only with a OTLP exporter, because the destination may return that 2 out of 1000 (which is formed out of 4 incoming requests) spans were not accepted you cannot know which of the incoming request is responsible for the errors, etc.

Currently, it is not in the scope of the collector to correctly calculate the correct number of "failed requests" because of components like filter processor, or fanout components (on one exporter everything is fine, on another is not), etc. Unless we design a solution that works in all these case (which should be a different effort and should be done across all the components), it is an unnecessary complication to try to calculate the number of "failed requests".

@jmacd jmacd linked a pull request Dec 18, 2024 that will close this issue
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this issue Dec 19, 2024
#### Description

Split the internal `batcher.export()` interface into three methods. This
is a refactoring that was applied in
https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor
and is being back-ported as part of open-telemetry#11308. The reason this refactoring
is needed is that the parent context of the export() request will be
manipulated in common code (vs signal-specific code) for tracing
support.

#### Link to tracking issue
Part of open-telemetry#11308

#### Testing
Existing tests cover this.

---------

Co-authored-by: Bogdan Drutu <[email protected]>
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this issue Dec 19, 2024
…d names (open-telemetry#11314)

#### Description
There are two implementations of the `batcher` interface with several
inconsistencies. Notably, the single-shard case would start a goroutine
before `Start()` is called, which can be confusing when the goroutine
leak checker notices it. This makes the single-shard batcher wait until
Start() to create its single shard. A confusing field name `batcher` in
this case becomes `single`, and a new field is added for use in start
named `processor` to create the new shard.

For the multi-shard batcher, `start()` does nothing, but this structure
confusingly embeds the batchProcessor. Rename the field `processor` for
consistency with the single-shard case.

#### Link to tracking issue
Part of open-telemetry#11308.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants