-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a good way to make max_bytes_per_file used in Stable data storage version #3393
Comments
So the logic for enforcing for batch in data:
writer.write(batch)
bytes_written = writer.tell()
if bytes_written > max_bytes_per_file:
writer.close()
writer = new_file() lance/rust/lance/src/dataset/write.rs Lines 256 to 282 in 2b784b3
So I think the problem you are encountering is if you make the input data 1 large batch, then there is only 1 iteration in that loop. So by the time we realize we've written too much data, it's too late. To enforce lance/rust/lance/src/dataset/write.rs Lines 247 to 249 in 2b784b3
lance/rust/lance-datafusion/src/chunker.rs Lines 136 to 146 in 2b784b3
To do something equivalent for bytes, I think the best we could do is attempt to split the in-memory stream into batches based on a computed average bytes per row, targeting some increment like 10 MB batches, so that we don't ever overshoot by too much. What do you think of that? |
Yes, now the It's better to be limited by |
I'm not sure you understood me. And it doesn't allocate significantly more memory. The smaller batches are zero-copy views into the larger batches. |
OK, I get it now. |
Having a good "bytes chunker" would allow us to remove a sort of ugly hack we have further down in the writer too so I'd be in favor of the idea. |
Now, with the Stable data storage version, the
max_bytes_per_file
option did not affect data file size.Like the example below, the data file has
9000569
bytes in the file.@westonpace is there a good way to limit the data file size?
The text was updated successfully, but these errors were encountered: