Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema Mismatch Error When appending Parquet Files with Metadata using Rust Engine #2888

Closed
pyjads opened this issue Sep 18, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@pyjads
Copy link

pyjads commented Sep 18, 2024

Environment

Delta-rs version: 0.20.0

Binding:

Environment:

  • OS: macos
from pathlib import Path
from deltalake import write_deltalake
from pyarrow.dataset import dataset
import pyarrow.parquet as pq
import pyarrow as pa
import shutil

delta_config = {
    "engine": "rust",
    "schema_mode": "merge",
    
}

# delta_config = {
#     "engine": "pyarrow",
# }

def write():
    
    shutil.rmtree(Path(".delta2"))
    file_extension = 'parquet'
    p3 = dataset("example_with_field_id_2.parquet", format=file_extension).to_table()
    p4 = dataset("example_with_field_id_2.parquet", format=file_extension).to_table()
    print(p3.schema)
    print('====================')
    print(p4.schema)

    write_deltalake(
        table_or_uri=".delta2",
        data=p4,
        mode="append",
        partition_by=["column_3"],
        **delta_config
    )

    write_deltalake(
        table_or_uri=".delta2",
        data=p3,
        mode="append",
        partition_by=["column_3"],
        **delta_config
    )

def create(id):

    # Define the custom schema with field_id
    schema = pa.schema([
        pa.field("column_1", pa.int32(), metadata={'parquet.field_id': '1'}),
        pa.field("column_2", pa.float64(), metadata={'parquet.field_id': '1'}),
        pa.field("column_3", pa.string(), metadata={'parquet.field_id': '1'}),
    ])

    # Create some data for the columns
    data = {
        'column_1': [1, 2, 3],
        'column_2': [1.1, 2.2, 3.3],
        'column_3': ['d', 'e', 'f']
    }

    # Convert to PyArrow table
    table = pa.Table.from_pydict(data, schema=schema)

    print(table.schema)

    # Write to Parquet file
    pq.write_table(table, f'example_with_field_id_{id}.parquet')
    

create(1)
create(2)
write()

When trying to append two Parquet files with custom field_id metadata using the rust engine, the following error is raised:

_internal.SchemaMismatchError: Schema error: Cannot merge metadata with different values for key parquet.field_id

When using the PyArrow engine, data is written to Delta Lake successfully. Also, it works with pandas dataframe with both engine.

@pyjads pyjads added the bug Something isn't working label Sep 18, 2024
@pyjads pyjads changed the title Schema Mismatch Error When Merging Parquet Files with Metadata using Rust Engine Schema Mismatch Error When appending Parquet Files with Metadata using Rust Engine Sep 18, 2024
@ion-elgreco
Copy link
Collaborator

Already mentioned here: #2850, and it's already fixed in delta-kernel-rs

@nfoerster
Copy link

So the fix should be in the 0.20.0?

Because it still occurs in that version:

Traceback (most recent call last):
  File "/Users/nfoerster/Repos/dashboards/cpm/dbt/dt_merge.py", line 17, in <module>
    write_deltalake(
  File "/Users/nfoerster/Repos/dashboards/.venv/lib/python3.12/site-packages/deltalake/writer.py", line 323, in write_deltalake
    write_deltalake_rust(
_internal.SchemaMismatchError: Schema error: Cannot merge metadata with different values for key PARQUET:field_id
(.venv) nfoerster@Normans-MacBook-Pro dbt % pip freeze | grep deltalake
deltalake==0.20.0
duckdb_deltalake_dbt @ file:///Users/nfoerster/Repos/dashboards/duckdb_deltalake_dbt

@ion-elgreco
Copy link
Collaborator

So the fix should be in the 0.20.0?

Because it still occurs in that version:

Traceback (most recent call last):
  File "/Users/nfoerster/Repos/dashboards/cpm/dbt/dt_merge.py", line 17, in <module>
    write_deltalake(
  File "/Users/nfoerster/Repos/dashboards/.venv/lib/python3.12/site-packages/deltalake/writer.py", line 323, in write_deltalake
    write_deltalake_rust(
_internal.SchemaMismatchError: Schema error: Cannot merge metadata with different values for key PARQUET:field_id
(.venv) nfoerster@Normans-MacBook-Pro dbt % pip freeze | grep deltalake
deltalake==0.20.0
duckdb_deltalake_dbt @ file:///Users/nfoerster/Repos/dashboards/duckdb_deltalake_dbt

No in the next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants