Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it #11726

blumamir · 2024-11-21T20:05:28Z

TL;DR: when collectors have pressure, I want it to build up in the collectors and for data to drop if it cannot be consumed due to memory limits on the first layer collector that serves applications. this is to protect the instrumented runtime itself from building pressure which is problematic. I want to add an opt-in option to memory limiter processor configuration to drop data instead of refusing it.

Is your feature request related to a problem? Please describe.

I am maintaining the Odigos project, which deploys collectors in Kubernetes environments to set up a telemetry pipeline for collecting, processing, and exporting data to various destinations.

Odigos uses a two-layer collector design:

DaemonSets (Node-Level Collectors): Handle telemetry locally on each node.
Cluster Collectors: Auto-scaled Deployments for centralized processing.

We utilize node-level collectors to ensure local data export and offload concerns like batching, retries, buffering, and cluster-wide networking from users' applications. However, the pipeline can experience pressure under specific conditions:

Downstream Backpressure: If a downstream component refuses data, queues grow, leading to increased memory and CPU usage.
Data Bursts: Sudden traffic spikes may overwhelm node collectors before cluster collectors scale.
Bugs or Configuration Issues: Errors or specific data patterns (e.g., large spans) can cause inefficiencies in handling the load.

Our objective is to buffer and retry within the collectors during transient failures or bursts, preventing backpressure from impacting users' applications. However, if memory pressure builds up, we want to avoid returning retryable errors to applications, which could inadvertently increase their resource usage.

Describe the solution you'd like

To address this, I propose enhancing the Memory Limiter Processor with a new configuration option:

New Option: Introduce a boolean flag to control whether the processor should drop data instead of returning retryable errors during memory pressure.
Default Behavior: Maintain the current behavior (returning retryable errors).
Opt-In Behavior: When enabled, the processor would drop data under memory pressure rather than propagating errors back to applications.

This change involves adding the new configuration option and updating the processor's logic here and for other signals, enabling it to either refuse data or drop it based on the setting.

Describe alternatives you've considered

Wondering it if makes sense to have the memory limiter positioned after a batch processor, so if the memory is too high, the batch is dropped, but the consumer always receives indicating that the data is pushed successfully to batch. I guess the downside is that memory pressure may still build up in the batch processor itself which can eat up the safety reserves while the memory pressure is active.

Additional context

If there is support for this feature, I am willing to contribute by creating a PR.

…1827) ## Problem At the moment, if there is pressure in the pipeline for any reason, and batches are failed to export, they will start building up in the queues of the collector exporter and grow memory unboundly. Since we don't set any memory request or limit on the node collectors ds, they will just go on to consume more and more of the available memory on the node: 1. Will show a pick in resource consumption on the cluster metrics. 2. Starve other pods on the same node, which now has less spare memory to grow into. 3. If the issue is not transient, the memory will just keep increasing over time 4. The amount of data in the retry buffers, will keep the CPU busy attempting to retry the rejected or unsuccessful batches. ## Levels of Protections To prevent the above issues, we imply few level of protections, listed from first line to last resort: 1. setting GOMEMLIMIT to a (now hardcoded constant) `352MiB`. At this point, go runtime GC should kick in and start reclaiming memory aggressively. 2. Setting the otel collector soft limit to (now hardcoded constant) `384MiB`. When the heap allocations reach this amount, the collector will start dropping batches of data after they are exported from the `batch` processor, instead of streaming them down the pipeline. 3. Setting the otel collector hard limit to `512MiB`. When the heap reaches this number, a forced GC is performed. 4. Setting the memory request to `256MiB`. This ensures we have at least this amount of memory to handle normal traffic and some slack for spikes without running into OOM. the rest of the memory is consumed from available memory on the node which by handy for more buffering, but may also cause OOM if the node has no resources. ## Future Work - Add configuration options to set these values, preferably as a spectrum for trace-offs: "resource-stability", "resource-spikecapacity" - drop the data as it received not after it is batched - open-telemetry/opentelemetry-collector#11726 - drop data at receiver when it's implemented in collector - open-telemetry/opentelemetry-collector#9591

blumamir mentioned this issue Nov 23, 2024

feat: add memory limiter to drop data when a soft limit is reached odigos-io/odigos#1827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it #11726

Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it #11726

blumamir commented Nov 21, 2024 •

edited

Loading

Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it #11726

Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it #11726

Comments

blumamir commented Nov 21, 2024 • edited Loading

blumamir commented Nov 21, 2024 •

edited

Loading