Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feather chunksize doesn't round-trip #45422

Open
alippai opened this issue Feb 4, 2025 · 1 comment
Open

Feather chunksize doesn't round-trip #45422

alippai opened this issue Feb 4, 2025 · 1 comment

Comments

@alippai
Copy link
Contributor

alippai commented Feb 4, 2025

Describe the usage question you have. Please include as many useful details as possible.

I tried this with pyarrow 19:

import pyarrow.feather as pf
t = ...
pf.write_feather(t, 'test.feather', chunksize=1024*1024)
len(pf.read_table('test.feather').to_batches()[0]) # 65536 rows
pf.write_feather(t, 'test2.feather', chunksize=256*1024)
len(pf.read_table('test2.feather').to_batches()[0]) # 65536 rows

I expected the files to be different (different compressed sizes), but they are byte-by-byte identical. As a consequence the batch sizes are lost when reading the data back.

Do I assume correctly the file should consist of chunksize long buffers for each column (per recordbatch) and these buffers are independently compressed using lz4 or zstd?

Component(s)

Python, C++, Format

@alippai
Copy link
Contributor Author

alippai commented Feb 4, 2025

Is this the equivalent?

BATCH_SIZE = 1024*1024

if len(t.to_batches) > 1:
   t = t.combine_chunks()
with pa.OSFile('test3.feather', 'wb') as sink:
   with pa.ipc.new_file(sink, t.schema, options=pa.ipc.IpcWriteOptions(compression='lz4')) as writer:
      for batch in t.to_batches(BATCH_SIZE):
         writer.write(batch)
len(pf.read_table('test3.feather').to_batches()[0]) # 1024*1024 rows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant