You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
I wrote a monotonically incrementing sequence into a deltalake table using the pyarrow engine.
When reading this deltalake table, the data is no longer monotonically incrementing.
What you expected to happen: I expect the data to be monotonically incrementing.
The rust engine appears to work as expected, however the pyarrow engine appears to re-order the data.
How to reproduce it:
Minimal example which reproduces the bug consistently on my laptop.
# out_of_order.py
import argparse
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from deltalake import write_deltalake, DeltaTable
def write_data_file(file, schema, length, batch_size):
with pq.ParquetWriter(file,
schema=schema,
compression='gzip',
compression_level=6) as writer:
for i in range(0, length, batch_size):
rows = min(i + batch_size, length) - i
df = pd.DataFrame(range(i, i + rows, 1), columns=['increment'])
batch = pa.record_batch(schema=schema, data=df)
writer.write_batch(batch)
df = pd.read_parquet(file)
assert df['increment'].is_monotonic_increasing, 'data file not monotonic'
def write_delta(engine, uri, schema, file, batch_size):
with pq.ParquetFile(file) as data:
write_deltalake(table_or_uri=uri,
data=data.iter_batches(batch_size=batch_size),
schema=schema,
mode='overwrite',
engine=engine)
def assert_monotonic(engine, uri):
dt = DeltaTable(uri)
assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'
if __name__ == '__main__':
parser = argparse.ArgumentParser(usage=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--path', required=True, help='Deltalake table path')
parser.add_argument('--length', default=62914561, type=int, help='Dataset length')
parser.add_argument('--batch-size', default=100_000, type=int, help='Batch size')
args = parser.parse_args()
schema = pa.schema([
pa.field('increment', pa.int64(), nullable=False),
])
os.makedirs(args.path, exist_ok=True)
file = args.path + '/monotonic'
write_data_file(file, schema, args.length, args.batch_size)
uri = args.path + '/rust'
write_delta('rust', uri, schema, file, args.batch_size)
assert_monotonic('rust', uri)
uri = args.path + '/pyarrow'
write_delta('pyarrow', uri, schema, file, args.batch_size)
assert_monotonic('pyarrow', uri)
Run using the command:
python out_of_order.py --path $PWD/out_of_order
Following exception is raised:
Traceback (most recent call last):
File ".../out_of_order.py", line 64, in <module>
assert_monotonic('pyarrow', uri)
File ".../out_of_order.py", line 40, in assert_monotonic
assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: pyarrow not monotonic
More details:
The file which appears to be out of order on my machine is part-5 ($PWD/out_of_order/pyarrow/0-6e748f6b-f69e-47dd-8857-dc652c73cfef-5.parquet).
This is reproducible when running multiple times, implying it's somewhat deterministic, however it appears that a different "row group" is out of order each time, given that when inspecting the file a different increment number is non monotonic.
The text was updated successfully, but these errors were encountered:
Looks like this is mainly due to a known issue in pyarrow: apache/arrow#39030
However, I did find that a _delta_log transaction which has multiple add actions can have it's part files in an unsorted order in the transaction, which also contributes to this problem given that if the add action files are read in the "transaction order" then the data can also appear unsorted even though data in individual files is ordered.
Environment
Delta-rs version: 0.18.2
Binding: python
Environment:
Bug
What happened:
I wrote a monotonically incrementing sequence into a deltalake table using the pyarrow engine.
When reading this deltalake table, the data is no longer monotonically incrementing.
What you expected to happen: I expect the data to be monotonically incrementing.
The
rust
engine appears to work as expected, however thepyarrow
engine appears to re-order the data.How to reproduce it:
Minimal example which reproduces the bug consistently on my laptop.
Run using the command:
Following exception is raised:
More details:
The file which appears to be out of order on my machine is part-5 (
$PWD/out_of_order/pyarrow/0-6e748f6b-f69e-47dd-8857-dc652c73cfef-5.parquet
).This is reproducible when running multiple times, implying it's somewhat deterministic, however it appears that a different "row group" is out of order each time, given that when inspecting the file a different increment number is non monotonic.
The text was updated successfully, but these errors were encountered: