Feather chunksize doesn't round-trip #45422

alippai · 2025-02-04T20:26:47Z

Describe the usage question you have. Please include as many useful details as possible.

I tried this with pyarrow 19:

import pyarrow.feather as pf
t = ...
pf.write_feather(t, 'test.feather', chunksize=1024*1024)
len(pf.read_table('test.feather').to_batches()[0]) # 65536 rows
pf.write_feather(t, 'test2.feather', chunksize=256*1024)
len(pf.read_table('test2.feather').to_batches()[0]) # 65536 rows

I expected the files to be different (different compressed sizes), but they are byte-by-byte identical. As a consequence the batch sizes are lost when reading the data back.

Do I assume correctly the file should consist of chunksize long buffers for each column (per recordbatch) and these buffers are independently compressed using lz4 or zstd?

Component(s)

Python, C++, Format

alippai · 2025-02-04T20:45:15Z

Is this the equivalent?

BATCH_SIZE = 1024*1024

if len(t.to_batches) > 1:
   t = t.combine_chunks()
with pa.OSFile('test3.feather', 'wb') as sink:
   with pa.ipc.new_file(sink, t.schema, options=pa.ipc.IpcWriteOptions(compression='lz4')) as writer:
      for batch in t.to_batches(BATCH_SIZE):
         writer.write(batch)
len(pf.read_table('test3.feather').to_batches()[0]) # 1024*1024 rows

alippai added the Type: usage Issue is a user question label Feb 4, 2025

github-actions bot added Component: C++ Component: Python Component: Format labels Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feather chunksize doesn't round-trip #45422

Feather chunksize doesn't round-trip #45422

alippai commented Feb 4, 2025 •

edited

Loading

alippai commented Feb 4, 2025 •

edited

Loading

Feather chunksize doesn't round-trip #45422

Feather chunksize doesn't round-trip #45422

Comments

alippai commented Feb 4, 2025 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

alippai commented Feb 4, 2025 • edited Loading

alippai commented Feb 4, 2025 •

edited

Loading

alippai commented Feb 4, 2025 •

edited

Loading