Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_deltalake identifies large_string as datatype even though string is set in schema #2374

Closed
dimitarmaya opened this issue Apr 2, 2024 · 4 comments · Fixed by #2635
Closed
Labels
bug Something isn't working

Comments

@dimitarmaya
Copy link

dimitarmaya commented Apr 2, 2024

Environment

Delta-rs version:0.16.2

Environment:

  • Cloud provider: Azure
  • OS:Ubuntu
  • Other:

Bug

What happened: Tried to create a delta table with data however it throwed an error with invalid types. If the schema is converted in pyarrow the issue does not exists. The error is the following Schema of data does not match table schema.
Similar issue with this one issue however the dataframe is not empty

What you expected to happen:

How to reproduce it: Try to create a schema from deltalake.Schema and add primitive types for example string

More details:

@dimitarmaya dimitarmaya added the bug Something isn't working label Apr 2, 2024
@ion-elgreco
Copy link
Collaborator

@dimitarmaya please provide a small reproducible example.

@dimitarmaya
Copy link
Author

dimitarmaya commented Apr 2, 2024

Hi @ion-elgreco
Just complied the code bellow in order the issue to be reproducible


import pandas as pd
from deltalake import Schema, write_deltalake
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
from decouple import config

column_names=['campaign','account']
json_schema='{"type": "struct","fields": [{"name": "campaign", "type": "string", "nullable": true, "metadata": {}},{"name": "account", "type": "string", "nullable": true, "metadata": {}}]}'
df=pd.DataFrame(columns=column_names)
table_schema=Schema.from_json(json_schema)


default_credential = DefaultAzureCredential()

storage_options = {"account_name": config('AZURE_STORAGE_ACCOUNT'), "account_key": config('AZURE_ACCOUNT_KEY')}

table_uri=config('AZURE_DATA_LAKE_URL')+'test/table'
write_deltalake(table_or_uri=table_uri,data=df,schema=table_schema,storage_options=storage_options)


df.loc[-1]=['124342515','123125435235']
write_deltalake(table_or_uri=table_uri,data=df,schema=table_schema,storage_options=storage_options)

It yields the following outcome
ValueError: Schema of data does not match table schema
Data schema:
campaign: large_string
account: large_string
Table Schema:
campaign: string
account: string

PS. The main reason why I want to keep with internal schema in my real world scenario is that I want to store column metadata in the delta log. Pyrrow schema is stripping the metadata info from the columns

@filipkoravik
Copy link

I am hitting the same issue! 🤔

@sherlockbeard
Copy link
Contributor

one workaround is forcing large_dtypes = True


import pandas as pd
from deltalake import Schema, write_deltalake

column_names=['campaign','account']
json_schema='{"type": "struct","fields": [{"name": "campaign", "type": "string", "nullable": true, "metadata": {}},{"name": "account", "type": "string", "nullable": true, "metadata": {}}]}'
df=pd.DataFrame(columns=column_names)
table_schema=Schema.from_json(json_schema)


write_deltalake('./test1',data=df,schema=table_schema, large_dtypes=True)


df.loc[-1]=['124342515','123125435235']
write_deltalake('./test1',data=df,schema=table_schema, mode='append', large_dtypes=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants