External aggregation reserves more memory than actual usage #13089

2010YOUY01 · 2024-10-24T09:42:51Z

Describe the bug

The below query requires 65M memory to run, if we set memory limit to 50M, it can not run successfully
Run in datafusion-cli:

cargo run -- --mem-pool-type fair -m 50M -c "
select t1.v1,  sum(t2.v1)
from
unnest(generate_series(1,1000)) as t1(v1)
, unnest(generate_series(1,1000)) as t2(v1)
group by t1.v1, t2.v1"

Error: External error: Resources exhausted: Failed to allocate additional 47616 bytes for GroupedHashAggregateStream[0] with 3995896 bytes already allocated for this reservation - 4031073 bytes remain available for the total pool

The issue is when doing sort-merge memory usage is over-estimated

datafusion/datafusion/physical-plan/src/sorts/builder.rs

Line 72 in f2da32b

self.reservation.try_grow(batch.get_array_memory_size())?;

For example, a RecordBatch with 3 arrays, arrays are sharing the same buffers, record_batch.get_array_memory_size() will estimate 3X actual memory consumption.
(The original RecordBatches passing through datafusion operators don't share Buffer between different columns, but in spilling queries, RecordBatches are first written to disk and read back, then it will reuse Buffers among different column arrays)

The root cause is already reported in arrow-rs apache/arrow-rs#6363
Once it's fixed in the arrow we should check if this aggregation query can run successfully, and also add tests.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

2010YOUY01 added the bug Something isn't working label Oct 24, 2024

2010YOUY01 mentioned this issue Oct 24, 2024

Add benchmark for memory-limited aggregation #13090

Merged

1 task

This was referenced Nov 12, 2024

Fix record batch memory size double counting #13377

Merged

Replace record_batch.get_array_memory_size() in spilling operators #13430

Open

comphead closed this as completed in #13377 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External aggregation reserves more memory than actual usage #13089

External aggregation reserves more memory than actual usage #13089

2010YOUY01 commented Oct 24, 2024

External aggregation reserves more memory than actual usage #13089

External aggregation reserves more memory than actual usage #13089

Comments

2010YOUY01 commented Oct 24, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context