Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite mode does not work with Azure #939

Closed
dennyglee opened this issue Nov 16, 2022 · 13 comments
Closed

Overwrite mode does not work with Azure #939

dennyglee opened this issue Nov 16, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@dennyglee
Copy link
Collaborator

dennyglee commented Nov 16, 2022

Environment

Delta-rs version: v0.6.3

Binding: Python 3.9.12

Environment:

  • Cloud provider: Azure, locally
  • OS: MacOS Ventura, M1
  • Other:

Bug

What happened:
When using overwrite mode locally this works but when using it with Azure storage, there is a nondescript error.

What you expected to happen:
When enabling overwrite mode, the table should be overwritten.

How to reproduce it:

import pandas as pd
from deltalake.writer import write_deltalake
storage_options = {"AZURE_STORAGE_ACCOUNT_KEY": "myaccountkey"}
df = pd.DataFrame({'x': range(100)})
table_root = "abfss://[email protected]/data"
write_deltalake(table_root, df, partition_by=["x"], storage_options=storage_options, mode="overwrite")

Note: Thanks to @craustin per #915 for the code snippet.

results in this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/deltalake/writer.py", line 168, in write_deltalake
    storage_options = dict(
TypeError: dict() got multiple values for keyword argument 'AZURE_STORAGE_ACCOUNT_KEY'

Update: When running the same code snippet from Azure VM using Ubuntu 20.04, I get the following error:

>>> write_deltalake(table_root, df, partition_by=["x"], mode="overwrite", storage_options=storage_options, max_open_files=8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/azureuser/.local/lib/python3.8/site-packages/deltalake/writer.py", line 254, in write_deltalake
    ds.write_dataset(
  File "/home/azureuser/.local/lib/python3.8/site-packages/pyarrow/dataset.py", line 988, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in pyarrow._dataset._filesystemdataset_write
deltalake.PyDeltaTableError: Generic MicrosoftAzure error: Error performing put request thingyratings/tables/test1_part/x=72/0-433a73dd-acb9-4895-ab4c-4d501c019e68-0.parquet: response error "request error", after 0 retries: error sending request for url (https://lakeything.blob.core.windows.net/root/thingyratings/tables/test1_part/x=72/0-433a73dd-acb9-4895-ab4c-4d501c019e68-0.parquet?comp=blocklist): dispatch task is gone: runtime dropped the dispatch task

More details:
Note, if you run this locally this works as expected

import pandas as pd
from deltalake.writer import write_deltalake
df = pd.DataFrame({'x': range(100)})
table_local = "/myworkspace/test/test0_part"
write_deltalake(table_local, df, partition_by=["x"], mode="overwrite")
@dennyglee dennyglee added the bug Something isn't working label Nov 16, 2022
@roeap
Copy link
Collaborator

roeap commented Nov 16, 2022

Thanks for the detailed report! I was able to reproduce the second error message on a linux machine. The first error message is on fact quite confusing.

After trying a bit it seems this is mostly occuring when we have a large number of write actions running and for some reason some client instances of the underlying reqwest library client are dropped before they reported back being done with the operation.. there are some open issues out there in the tokio / hyper work related to this.

In essence this needs a bit more investigating and also "should" not be special to azure. THen again I may be on the wrong track,... will keep investigating.

@dennyglee
Copy link
Collaborator Author

Thanks @roeap - @craustin had mentioned the same thing this may be related to the hyper work for #915. Let me know if I can help, I'll do a little debugging later today as well, eh?! :)

@fvaleye
Copy link
Collaborator

fvaleye commented Nov 16, 2022

Hello @dennyglee 👋,

Thanks for your report!

Traceback (most recent call last):
File "", line 1, in
File "/opt/homebrew/lib/python3.9/site-packages/deltalake/writer.py", line 168, in write_deltalake
storage_options = dict(
TypeError: dict() got multiple values for keyword argument 'AZURE_STORAGE_ACCOUNT_KEY'

Looking at the first error, it seems that the storage_options parameter AZURE_STORAGE_ACCOUNT_KEY is already set in the storage options of the Delta Table. Trying to put the same parameter again when writing to the Delta Table introduces this error:
TypeError: dict() got multiple values for keyword argument 'AZURE_STORAGE_ACCOUNT_KEY'
Due to the way we try to merge storage_options values with the same key: here and we introduce a duplication in the dictionnary's key.

I will fix this in a new PR.
In the meantime, you could write by just having the following:
write_deltalake(table_root, df, partition_by=["x"], mode="overwrite")

@dennyglee
Copy link
Collaborator Author

Awesome, thanks!

@wjones127
Copy link
Collaborator

@fvaleye I think I actually fixed that one in #912. I can polish that up soon.

@fvaleye
Copy link
Collaborator

fvaleye commented Nov 16, 2022

Oh, right @wjones127, it's here 👍
Thanks, we will wait then!

@0xdarkman
Copy link

0xdarkman commented Nov 24, 2022

write_deltalake(table_root, df, partition_by=["x"], mode="overwrite")

@fvaleye do you then provide storage account key?

@0xdarkman
Copy link

0xdarkman commented Nov 24, 2022

@fvaleye ok, I understand this shall work:

storage_options = {
  "AZURE_STORAGE_ACCOUNT_NAME": account_name, 
  "AZURE_STORAGE_ACCOUNT_KEY": account_key,
}

table_path = "abfss://[email protected]/TABLE_NAME"
dt = DeltaTable(table_path, storage_options=storage_options)

write_deltalake(table_or_uri=dt, df=df, mode="overwrite")

@0xdarkman
Copy link

0xdarkman commented Nov 24, 2022

ValueError: Schema of data does not match table schema
Table schema:
domain: string
type: string
tld_manager: string
__index_level_0__: int64
schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 670
Data Schema:
domain: string
type: string
tld_manager: string

is index causing problem?

Although, should not the table be created when table does not exist?

@0xdarkman
Copy link

0xdarkman commented Nov 24, 2022

import pyarrow as pa
tb = pa.Table.from_pandas(df, preserve_index=False)
write_deltalake(table_or_uri=dt, data=tb, mode="overwrite")

was the missing link :)

@roeap
Copy link
Collaborator

roeap commented Nov 29, 2022

@0xdarkman @dennyglee - 0.6.4 hopefully fixed this ... you you verify?

@dennyglee
Copy link
Collaborator Author

I just tested this with 0.6.4 but there is an issue with the dict seeing duplicates of TypeError: dict() got multiple values for keyword argument 'AZURE_STORAGE_ACCOUNT_KEY'

Per the comment above, I tested it by removing the storage_options call but ended up getting the following error deltalake.PyDeltaTableError: Generic DeltaTable error: Failed to find valid credential.

@wjones127
Copy link
Collaborator

Sorry, the fix for that was in #912 but didn't make it in time for 0.6.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants