write_deltalake identifies large_string as datatype even though string is set in schema #2374

dimitarmaya · 2024-04-02T12:29:22Z

Environment

Delta-rs version:0.16.2

Environment:

Cloud provider: Azure
OS:Ubuntu
Other:

Bug

What happened: Tried to create a delta table with data however it throwed an error with invalid types. If the schema is converted in pyarrow the issue does not exists. The error is the following Schema of data does not match table schema.
Similar issue with this one issue however the dataframe is not empty

What you expected to happen:

How to reproduce it: Try to create a schema from deltalake.Schema and add primitive types for example string

More details:

ion-elgreco · 2024-04-02T12:39:11Z

@dimitarmaya please provide a small reproducible example.

dimitarmaya · 2024-04-02T18:34:20Z

Hi @ion-elgreco
Just complied the code bellow in order the issue to be reproducible


import pandas as pd
from deltalake import Schema, write_deltalake
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
from decouple import config

column_names=['campaign','account']
json_schema='{"type": "struct","fields": [{"name": "campaign", "type": "string", "nullable": true, "metadata": {}},{"name": "account", "type": "string", "nullable": true, "metadata": {}}]}'
df=pd.DataFrame(columns=column_names)
table_schema=Schema.from_json(json_schema)


default_credential = DefaultAzureCredential()

storage_options = {"account_name": config('AZURE_STORAGE_ACCOUNT'), "account_key": config('AZURE_ACCOUNT_KEY')}

table_uri=config('AZURE_DATA_LAKE_URL')+'test/table'
write_deltalake(table_or_uri=table_uri,data=df,schema=table_schema,storage_options=storage_options)


df.loc[-1]=['124342515','123125435235']
write_deltalake(table_or_uri=table_uri,data=df,schema=table_schema,storage_options=storage_options)

It yields the following outcome
ValueError: Schema of data does not match table schema
Data schema:
campaign: large_string
account: large_string
Table Schema:
campaign: string
account: string

PS. The main reason why I want to keep with internal schema in my real world scenario is that I want to store column metadata in the delta log. Pyrrow schema is stripping the metadata info from the columns

filipkoravik · 2024-06-28T19:05:02Z

I am hitting the same issue! 🤔

sherlockbeard · 2024-06-29T11:30:30Z

one workaround is forcing large_dtypes = True


import pandas as pd
from deltalake import Schema, write_deltalake

column_names=['campaign','account']
json_schema='{"type": "struct","fields": [{"name": "campaign", "type": "string", "nullable": true, "metadata": {}},{"name": "account", "type": "string", "nullable": true, "metadata": {}}]}'
df=pd.DataFrame(columns=column_names)
table_schema=Schema.from_json(json_schema)


write_deltalake('./test1',data=df,schema=table_schema, large_dtypes=True)


df.loc[-1]=['124342515','123125435235']
write_deltalake('./test1',data=df,schema=table_schema, mode='append', large_dtypes=True)

dimitarmaya added the bug Something isn't working label Apr 2, 2024

sherlockbeard mentioned this issue Jun 29, 2024

fix(python): fixed large_dtype to schema convert #2635

Merged

rtyler closed this as completed in #2635 Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_deltalake identifies large_string as datatype even though string is set in schema #2374

write_deltalake identifies large_string as datatype even though string is set in schema #2374

dimitarmaya commented Apr 2, 2024 •

edited

Loading

ion-elgreco commented Apr 2, 2024

dimitarmaya commented Apr 2, 2024 •

edited

Loading

filipkoravik commented Jun 28, 2024

sherlockbeard commented Jun 29, 2024

write_deltalake identifies large_string as datatype even though string is set in schema #2374

write_deltalake identifies large_string as datatype even though string is set in schema #2374

Comments

dimitarmaya commented Apr 2, 2024 • edited Loading

Environment

Bug

ion-elgreco commented Apr 2, 2024

dimitarmaya commented Apr 2, 2024 • edited Loading

filipkoravik commented Jun 28, 2024

sherlockbeard commented Jun 29, 2024

dimitarmaya commented Apr 2, 2024 •

edited

Loading

dimitarmaya commented Apr 2, 2024 •

edited

Loading