Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read delta table: Delta protocol violation #1557

Closed
ghost opened this issue Jul 24, 2023 · 1 comment
Closed

Cannot read delta table: Delta protocol violation #1557

ghost opened this issue Jul 24, 2023 · 1 comment
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Milestone

Comments

@ghost
Copy link

ghost commented Jul 24, 2023

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Linux
  • Other:

Bug

What happened:

I'm trying to read a delta table from Azure, but it throws an error:

import deltalake

path = "az://..."
storage_options = {
    "AZURE_STORAGE_TENANT_ID": "...",
    "AZURE_STORAGE_CLIENT_ID": "...",
    "AZURE_STORAGE_CLIENT_SECRET": "...",
    "AZURE_STORAGE_ACCOUNT_NAME": "...",
}

dt = deltalake.DeltaTable(path, version=None, storage_options=storage_options)
# DeltaError: Delta protocol violation: Invalid action field: type for schemaString in metaData action should be string

Full stacktrace:

---> 17 dt = deltalake.DeltaTable(full_path, version=None, storage_options=credentials)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a6639292-dd1e-4998-a05e-f09df9265cb7/lib/python3.10/site-packages/deltalake/table.py:238, in DeltaTable.__init__(self, table_uri, version, storage_options, without_files)
    225 """
    226 Create the Delta Table from a path with an optional version.
    227 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    235                       DeltaTable will be loaded with a significant memory reduction.
    236 """
    237 self._storage_options = storage_options
--> 238 self._table = RawDeltaTable(
    239     str(table_uri),
    240     version=version,
    241     storage_options=storage_options,
    242     without_files=without_files,
    243 )
    244 self._metadata = Metadata(self._table)

DeltaError: Delta protocol violation: Invalid action field: type for schemaString in metaData action should be string

I was able to get details on the table I am trying to read via Databricks delta package:

Row(format='delta', partitionColumns=[], clusteringColumns=None, numFiles=1, sizeInBytes=13600, properties={'pipelines.pipelineId': '668e3f60-894b-44d3-a205-a5a0c08a4a8b'}, minReaderVersion=1, minWriterVersion=2, tableFeatures=['appendOnly', 'invariants'])

After some digging, it seems like our Databricks job first does a DLT REFRESH operation, which creates version 0 of the table. This has no metadata (this trips up deltalake). Then it does a DLT SETUP operation, which creates version 1 and adds metadata. Then it does a WRITE operation. Version 0 is invalid (no metadata) which crashes deltalake.

What you expected to happen:

I would expect to be able to read the table normally.

How to reproduce it:

I don't know how to help you reproduce this.

@ghost ghost added the bug Something isn't working label Jul 24, 2023
@rtyler rtyler added this to the Rust v0.16 milestone Sep 15, 2023
@rtyler rtyler self-assigned this Sep 20, 2023
@rtyler rtyler added the binding/rust Issues for the Rust crate label Sep 20, 2023
@rtyler
Copy link
Member

rtyler commented Sep 25, 2023

Fix will be in the next release

@rtyler rtyler closed this as completed Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant