Is there a good way to make max_bytes_per_file used in Stable data storage version #3393

SaintBacchus · 2025-01-18T06:31:48Z

Now, with the Stable data storage version, the max_bytes_per_file option did not affect data file size.

Like the example below, the data file has 9000569 bytes in the file.

import lance
import pyarrow as pa

def generate_large_table():
    data = []
    for i in range(1000000):
        if i % 2 == 0:
            data.append({"name": "Alice", "age": 20})
        else:
            data.append({"name": "Bob", "age": 30})
    table = pa.Table.from_pylist(data)
    return table

table = generate_large_table()

lance.write_dataset(table, "./alice_and_bob.lance", max_bytes_per_file=512)

@westonpace is there a good way to limit the data file size?

The text was updated successfully, but these errors were encountered:

wjones127 · 2025-01-20T21:09:42Z

So the logic for enforcing max_bytes_per_file looks like:

for batch in data:
    writer.write(batch)
    bytes_written = writer.tell()
    if bytes_written > max_bytes_per_file:
        writer.close()
        writer = new_file()

lance/rust/lance/src/dataset/write.rs

Lines 256 to 282 in 2b784b3

    
           while let Some(batch_chunk) = buffered_reader.next().await { 
        
               let batch_chunk = batch_chunk?; 
        
               if writer.is_none() { 
        
                   let (new_writer, new_fragment) = writer_generator.new_writer().await?; 
        
                   params.progress.begin(&new_fragment).await?; 
        
                   writer = Some(new_writer); 
        
                   fragments.push(new_fragment); 
        
               } 
        
               writer.as_mut().unwrap().write(&batch_chunk).await?; 
        
               for batch in batch_chunk { 
        
                   num_rows_in_current_file += batch.num_rows() as u32; 
        
               } 
        
               if num_rows_in_current_file >= params.max_rows_per_file as u32 
        
                   || writer.as_mut().unwrap().tell().await? >= params.max_bytes_per_file as u64 
        
               { 
        
                   let (num_rows, data_file) = writer.take().unwrap().finish().await?; 
        
                   debug_assert_eq!(num_rows, num_rows_in_current_file); 
        
                   params.progress.complete(fragments.last().unwrap()).await?; 
        
                   let last_fragment = fragments.last_mut().unwrap(); 
        
                   last_fragment.physical_rows = Some(num_rows as usize); 
        
                   last_fragment.files.push(data_file); 
        
                   num_rows_in_current_file = 0; 
        
               } 
        
           }

So I think the problem you are encountering is if you make the input data 1 large batch, then there is only 1 iteration in that loop. So by the time we realize we've written too much data, it's too late.

To enforce max_rows_per_file, what we do is split the input data into batches with at most max_rows_per_file for each batch.

lance/rust/lance/src/dataset/write.rs

Lines 247 to 249 in 2b784b3

    
           break_stream(data, params.max_rows_per_file) 
        
               .map_ok(|batch| vec![batch]) 
        
               .boxed()

lance/rust/lance-datafusion/src/chunker.rs

Lines 136 to 146 in 2b784b3

    
           // Given a stream of record batches, and a desired break point, this will 
        
           // make sure that a new record batch is emitted every time `break_point` rows 
        
           // have passed. 
        
           // 
        
           // This method will not combine record batches in any way.  For example, if 
        
           // the input lengths are [3, 5, 8, 3, 5], and the break point is 10 then the 
        
           // output batches will be [3, 5, 2 (break inserted) 6, 3, 1 (break inserted) 4] 
        
           pub fn break_stream( 
        
               stream: SendableRecordBatchStream, 
        
               max_chunk_size: usize, 
        
           ) -> Pin<Box<dyn Stream<Item = Result<RecordBatch>> + Send>> {

To do something equivalent for bytes, I think the best we could do is attempt to split the in-memory stream into batches based on a computed average bytes per row, targeting some increment like 10 MB batches, so that we don't ever overshoot by too much. What do you think of that?

SaintBacchus · 2025-01-21T06:27:35Z

Yes, now the break_stream only depends on the max_rows_per_file which will allocate much memory.

It's better to be limited by max_rows_per_file and max_bytes_per_file.

wjones127 · 2025-01-21T16:43:38Z

Yes, now the break_stream only depends on the max_rows_per_file which will allocate much memory.

It's better to be limited by max_rows_per_file and max_bytes_per_file.

I'm not sure you understood me. break_stream is how we successfully enforce max_rows_per_file. To enforce max_bytes_per_file, we'll want to modify break_stream() to also attempt to break the batches up based on approximate byte size.

And it doesn't allocate significantly more memory. The smaller batches are zero-copy views into the larger batches.

SaintBacchus · 2025-01-22T02:30:42Z

OK, I get it now.

westonpace · 2025-01-24T18:25:26Z

Having a good "bytes chunker" would allow us to remove a sort of ugly hack we have further down in the writer too so I'd be in favor of the idea.

SaintBacchus added the bug Something isn't working label Jan 18, 2025

SaintBacchus linked a pull request Feb 7, 2025 that will close this issue

feat: break stream by max bytes param #3435

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a good way to make max_bytes_per_file used in Stable data storage version #3393

Is there a good way to make max_bytes_per_file used in Stable data storage version #3393

SaintBacchus commented Jan 18, 2025

wjones127 commented Jan 20, 2025

SaintBacchus commented Jan 21, 2025

wjones127 commented Jan 21, 2025 •

edited

Loading

SaintBacchus commented Jan 22, 2025

westonpace commented Jan 24, 2025

Is there a good way to make max_bytes_per_file used in Stable data storage version #3393

Is there a good way to make max_bytes_per_file used in Stable data storage version #3393

Comments

SaintBacchus commented Jan 18, 2025

wjones127 commented Jan 20, 2025

SaintBacchus commented Jan 21, 2025

wjones127 commented Jan 21, 2025 • edited Loading

SaintBacchus commented Jan 22, 2025

westonpace commented Jan 24, 2025

wjones127 commented Jan 21, 2025 •

edited

Loading