-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix record batch memory size double counting #13377
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,14 +20,16 @@ | |
use std::fs::File; | ||
use std::io::BufReader; | ||
use std::path::{Path, PathBuf}; | ||
use std::ptr::NonNull; | ||
|
||
use arrow::array::ArrayData; | ||
use arrow::datatypes::SchemaRef; | ||
use arrow::ipc::reader::FileReader; | ||
use arrow::record_batch::RecordBatch; | ||
use log::debug; | ||
use tokio::sync::mpsc::Sender; | ||
|
||
use datafusion_common::{exec_datafusion_err, Result}; | ||
use datafusion_common::{exec_datafusion_err, HashSet, Result}; | ||
use datafusion_execution::disk_manager::RefCountedTempFile; | ||
use datafusion_execution::memory_pool::human_readable_size; | ||
use datafusion_execution::SendableRecordBatchStream; | ||
|
@@ -109,10 +111,83 @@ pub fn spill_record_batch_by_size( | |
Ok(()) | ||
} | ||
|
||
/// Calculate total used memory of this batch. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💯 for this comment |
||
/// | ||
/// This function is used to estimate the physical memory usage of the `RecordBatch`. | ||
/// It only counts the memory of large data `Buffer`s, and ignores metadata like | ||
/// types and pointers. | ||
/// The implementation will add up all unique `Buffer`'s memory | ||
/// size, due to: | ||
/// - The data pointer inside `Buffer` are memory regions returned by global memory | ||
/// allocator, those regions can't have overlap. | ||
/// - The actual used range of `ArrayRef`s inside `RecordBatch` can have overlap | ||
/// or reuse the same `Buffer`. For example: taking a slice from `Array`. | ||
/// | ||
/// Example: | ||
/// For a `RecordBatch` with two columns: `col1` and `col2`, two columns are pointing | ||
/// to a sub-region of the same buffer. | ||
/// | ||
/// {xxxxxxxxxxxxxxxxxxx} <--- buffer | ||
/// ^ ^ ^ ^ | ||
/// | | | | | ||
/// col1->{ } | | | ||
/// col2--------->{ } | ||
/// | ||
/// In the above case, `get_record_batch_memory_size` will return the size of | ||
/// the buffer, instead of the sum of `col1` and `col2`'s actual memory size. | ||
/// | ||
/// Note: Current `RecordBatch`.get_array_memory_size()` will double count the | ||
/// buffer memory size if multiple arrays within the batch are sharing the same | ||
/// `Buffer`. This method provides temporary fix until the issue is resolved: | ||
/// <https://github.com/apache/arrow-rs/issues/6439> | ||
pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in TopK, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think they should all be changed, however after changing them in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool -- can you possibly file a ticket to track any work that you know about? I can help file it / with the explanation as well |
||
// Store pointers to `Buffer`'s start memory address (instead of actual | ||
// used data region's pointer represented by current `Array`) | ||
let mut counted_buffers: HashSet<NonNull<u8>> = HashSet::new(); | ||
let mut total_size = 0; | ||
|
||
for array in batch.columns() { | ||
let array_data = array.to_data(); | ||
count_array_data_memory_size(&array_data, &mut counted_buffers, &mut total_size); | ||
} | ||
|
||
total_size | ||
} | ||
|
||
/// Count the memory usage of `array_data` and its children recursively. | ||
fn count_array_data_memory_size( | ||
array_data: &ArrayData, | ||
counted_buffers: &mut HashSet<NonNull<u8>>, | ||
total_size: &mut usize, | ||
) { | ||
// Count memory usage for `array_data` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit, but you can probably add size of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this approach also missed several other metadata's memory size (like datatype, buffer pointers), they will be included in the more-comprehensive fix in arrow side. |
||
for buffer in array_data.buffers() { | ||
if counted_buffers.insert(buffer.data_ptr()) { | ||
*total_size += buffer.capacity(); | ||
} // Otherwise the buffer's memory is already counted | ||
} | ||
|
||
if let Some(null_buffer) = array_data.nulls() { | ||
if counted_buffers.insert(null_buffer.inner().inner().data_ptr()) { | ||
*total_size += null_buffer.inner().inner().capacity(); | ||
} | ||
} | ||
|
||
// Count all children `ArrayData` recursively | ||
for child in array_data.child_data() { | ||
count_array_data_memory_size(child, counted_buffers, total_size); | ||
} | ||
Comment on lines
+176
to
+179
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it make sense to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've learned something new today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes I agree we don't need to annoate all recursive function calls -- only the ones that will become very large/deep |
||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
use crate::spill::{spill_record_batch_by_size, spill_record_batches}; | ||
use crate::test::build_table_i32; | ||
use arrow::array::{Float64Array, Int32Array}; | ||
use arrow::datatypes::{DataType, Field, Int32Type, Schema}; | ||
use arrow::record_batch::RecordBatch; | ||
use arrow_array::ListArray; | ||
use datafusion_common::Result; | ||
use datafusion_execution::disk_manager::DiskManagerConfig; | ||
use datafusion_execution::DiskManager; | ||
|
@@ -147,7 +222,7 @@ mod tests { | |
assert_eq!(cnt.unwrap(), num_rows); | ||
|
||
let file = BufReader::new(File::open(spill_file.path())?); | ||
let reader = arrow::ipc::reader::FileReader::try_new(file, None)?; | ||
let reader = FileReader::try_new(file, None)?; | ||
|
||
assert_eq!(reader.num_batches(), 2); | ||
assert_eq!(reader.schema(), schema); | ||
|
@@ -175,11 +250,138 @@ mod tests { | |
)?; | ||
|
||
let file = BufReader::new(File::open(spill_file.path())?); | ||
let reader = arrow::ipc::reader::FileReader::try_new(file, None)?; | ||
let reader = FileReader::try_new(file, None)?; | ||
|
||
assert_eq!(reader.num_batches(), 4); | ||
assert_eq!(reader.schema(), schema); | ||
|
||
Ok(()) | ||
} | ||
|
||
#[test] | ||
fn test_get_record_batch_memory_size() { | ||
// Create a simple record batch with two columns | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("ints", DataType::Int32, true), | ||
Field::new("float64", DataType::Float64, false), | ||
])); | ||
|
||
let int_array = | ||
Int32Array::from(vec![Some(1), Some(2), Some(3), Some(4), Some(5)]); | ||
let float64_array = Float64Array::from(vec![1.0, 2.0, 3.0, 4.0, 5.0]); | ||
|
||
let batch = RecordBatch::try_new( | ||
schema, | ||
vec![Arc::new(int_array), Arc::new(float64_array)], | ||
) | ||
.unwrap(); | ||
|
||
let size = get_record_batch_memory_size(&batch); | ||
assert_eq!(size, 60); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My only concern with this PR is that the result of This could be dangerous because the project would end up with two different methods of calculating memory sizes. I can imagine a scenario in the future, where we reserve memory based on one calculation method and shrink it using the result from the other. While the difference may not be large each time, over many repetitions or a large dataset, it could behave almost like a memory leak (but without actual memory), making debugging very challenging... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we completely switch to the new method, blocking the usage of the old one? Should we try to make two numbers match closely? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great point. I also feel that this manual memory accounting is complex and error-prone. We’d better change all of it. (Maybe also use some RAII in the implementation, instead of manually growing and shrinking memory usage as we’re doing right now.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Finding a way to automatically update the memory accounting is certainly a good idea in my mind. As we have mentioned, I think the most important thing will be to find a way to account for arrow buffers completely Then we can work it into DataFusion |
||
} | ||
|
||
#[test] | ||
fn test_get_record_batch_memory_size_with_null() { | ||
// Create a simple record batch with two columns | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("ints", DataType::Int32, true), | ||
Field::new("float64", DataType::Float64, false), | ||
])); | ||
|
||
let int_array = Int32Array::from(vec![None, Some(2), Some(3)]); | ||
let float64_array = Float64Array::from(vec![1.0, 2.0, 3.0]); | ||
|
||
let batch = RecordBatch::try_new( | ||
schema, | ||
vec![Arc::new(int_array), Arc::new(float64_array)], | ||
) | ||
.unwrap(); | ||
|
||
let size = get_record_batch_memory_size(&batch); | ||
assert_eq!(size, 100); | ||
} | ||
|
||
#[test] | ||
fn test_get_record_batch_memory_size_empty() { | ||
// Test with empty record batch | ||
let schema = Arc::new(Schema::new(vec![Field::new( | ||
"ints", | ||
DataType::Int32, | ||
false, | ||
)])); | ||
|
||
let int_array: Int32Array = Int32Array::from(vec![] as Vec<i32>); | ||
let batch = RecordBatch::try_new(schema, vec![Arc::new(int_array)]).unwrap(); | ||
|
||
let size = get_record_batch_memory_size(&batch); | ||
assert_eq!(size, 0, "Empty batch should have 0 memory size"); | ||
} | ||
|
||
#[test] | ||
fn test_get_record_batch_memory_size_shared_buffer() { | ||
// Test with slices that share the same underlying buffer | ||
let original = Int32Array::from(vec![1, 2, 3, 4, 5]); | ||
let slice1 = original.slice(0, 3); | ||
let slice2 = original.slice(2, 3); | ||
|
||
// `RecordBatch` with `original` array | ||
// ---- | ||
let schema_origin = Arc::new(Schema::new(vec![Field::new( | ||
"origin_col", | ||
DataType::Int32, | ||
false, | ||
)])); | ||
let batch_origin = | ||
RecordBatch::try_new(schema_origin, vec![Arc::new(original)]).unwrap(); | ||
|
||
// `RecordBatch` with all columns are reference to `original` array | ||
// ---- | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("slice1", DataType::Int32, false), | ||
Field::new("slice2", DataType::Int32, false), | ||
])); | ||
|
||
let batch_sliced = | ||
RecordBatch::try_new(schema, vec![Arc::new(slice1), Arc::new(slice2)]) | ||
.unwrap(); | ||
|
||
// Two sizes should all be only counting the buffer in `original` array | ||
let size_origin = get_record_batch_memory_size(&batch_origin); | ||
let size_sliced = get_record_batch_memory_size(&batch_sliced); | ||
|
||
assert_eq!(size_origin, size_sliced); | ||
} | ||
|
||
#[test] | ||
fn test_get_record_batch_memory_size_nested_array() { | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new( | ||
"nested_int", | ||
DataType::List(Arc::new(Field::new("item", DataType::Int32, true))), | ||
false, | ||
), | ||
Field::new( | ||
"nested_int2", | ||
DataType::List(Arc::new(Field::new("item", DataType::Int32, true))), | ||
false, | ||
), | ||
])); | ||
|
||
let int_list_array = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![ | ||
Some(vec![Some(1), Some(2), Some(3)]), | ||
]); | ||
|
||
let int_list_array2 = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![ | ||
Some(vec![Some(4), Some(5), Some(6)]), | ||
]); | ||
|
||
let batch = RecordBatch::try_new( | ||
schema, | ||
vec![Arc::new(int_list_array), Arc::new(int_list_array2)], | ||
) | ||
.unwrap(); | ||
|
||
let size = get_record_batch_memory_size(&batch); | ||
assert_eq!(size, 8320); | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be introduced as DataFusion parameter so the user can configure the memory allocation realm. I got some feeling the mem is data dependent, depending on datatypes/data being processed