Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem writing tables in directory named with char ~ #1806

Closed
bglezseoane opened this issue Nov 5, 2023 · 4 comments
Closed

Problem writing tables in directory named with char ~ #1806

bglezseoane opened this issue Nov 5, 2023 · 4 comments
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Milestone

Comments

@bglezseoane
Copy link

bglezseoane commented Nov 5, 2023

Environment

Delta-rs version: Python deltalake 0.12.0.
Binding: Python deltalake 0.12.0.
Environment: Local, Python 3.11.
OS: Mac OS.


Bug

What happened:

I am working in Mac OS local machine (ARM) and using Python deltalake binding for read and write Delta tables. I need to store some Delta tables in the directory synced with iCloud, which is named $HOME/Library/Mobile Documents/com~apple~CloudDocs. I have also a symlink in the home, $HOME/iCloud, pointing to this location, which I typically use instead of the actual full path.

When I try to use write_deltalake function, an error OSError: Encountered object with invalid path: Error parsing Path "/Users/***/Library/Mobile%20Documents/com~apple~CloudDocs/testing_deltalake/df": Encountered illegal character sequence "~" whilst parsing path segment "com~apple~CloudDocs" is raised. I am using the symlink in the write_deltalake call, so it seems that:

  1. The write_deltalake is resolving the symlink.
  2. The write_deltalakehas problems handling the ~ char.

Similar error applies using DeltaTable to recover a table.

What you expected to happen:

Apart from the fact that the directory syncs with iCloud and how canonical this may be using Delta tables, it is a fully valid path in this OS and I consider it should be possible to read and write tables to it. In fact, testing on any directory containing ~, the error persists.

I leave it up to the developers to decide whether symbolic links should be resolved or not, I lack the context to take a position on this.

How to reproduce it:

from tempfile import mkdtemp

import pandas as pd
from deltalake import write_deltalake

df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5],
        "fruits": ["banana", "orange", "mango", "apple", "banana"],
    }
)

# dest = Path().home() / "iCloud" / "testing_deltalake" / "df"
dest = mkdtemp(suffix="com~apple~CloudDocs")

write_deltalake(
    table_or_uri=dest,
    data=df,
)

Error:

{
	"name": "OSError",
	"message": "Encountered object with invalid path: Error parsing Path \"/private/var/folders/2h/xg9cljwj14d_nspp3fjk4pyw0000gn/T/tmpqh27vtamcom~apple~CloudDocs\": Encountered illegal character sequence \"~\" whilst parsing path segment \"tmpqh27vtamcom~apple~CloudDocs\"",
	"stack": "---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[13], line 16
     13 # dest = Path().home() / \"iCloud\" / \"testing_deltalake\" / \"df\"
     14 dest = mkdtemp(suffix=\"com~apple~CloudDocs\")
---> 16 write_deltalake(
     17     table_or_uri=dest,
     18     data=df,
     19 )

File ~/.cache/pypoetry/virtualenvs/***-***-py3.11/lib/python3.11/site-packages/deltalake/writer.py:153, in write_deltalake(table_or_uri, data, schema, partition_by, filesystem, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema, storage_options, partition_filters, large_dtypes)
    150     else:
    151         data, schema = delta_arrow_schema_from_pandas(data)
--> 153 table, table_uri = try_get_table_and_table_uri(table_or_uri, storage_options)
    155 # We need to write against the latest table version
    156 if table:

File ~/.cache/pypoetry/virtualenvs/***-***-py3.11/lib/python3.11/site-packages/deltalake/writer.py:417, in try_get_table_and_table_uri(table_or_uri, storage_options)
    414     raise ValueError(\"table_or_uri must be a str, Path or DeltaTable\")
    416 if isinstance(table_or_uri, (str, Path)):
--> 417     table = try_get_deltatable(table_or_uri, storage_options)
    418     table_uri = str(table_or_uri)
    419 else:

File ~/.cache/pypoetry/virtualenvs/***-***-py3.11/lib/python3.11/site-packages/deltalake/writer.py:430, in try_get_deltatable(table_uri, storage_options)
    426 def try_get_deltatable(
    427     table_uri: Union[str, Path], storage_options: Optional[Dict[str, str]]
    428 ) -> Optional[DeltaTable]:
    429     try:
--> 430         return DeltaTable(table_uri, storage_options=storage_options)
    431     except TableNotFoundError:
    432         return None

File ~/.cache/pypoetry/virtualenvs/***-***-py3.11/lib/python3.11/site-packages/deltalake/table.py:250, in DeltaTable.__init__(self, table_uri, version, storage_options, without_files, log_buffer_size)
    231 \"\"\"
    232 Create the Delta Table from a path with an optional version.
    233 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    247 
    248 \"\"\"
    249 self._storage_options = storage_options
--> 250 self._table = RawDeltaTable(
    251     str(table_uri),
    252     version=version,
    253     storage_options=storage_options,
    254     without_files=without_files,
    255     log_buffer_size=log_buffer_size,
    256 )
    257 self._metadata = Metadata(self._table)

OSError: Encountered object with invalid path: Error parsing Path \"/private/var/folders/2h/xg9cljwj14d_nspp3fjk4pyw0000gn/T/tmpqh27vtamcom~apple~CloudDocs\": Encountered illegal character sequence \"~\" whilst parsing path segment \"tmpqh27vtamcom~apple~CloudDocs\""
}
@bglezseoane bglezseoane added the bug Something isn't working label Nov 5, 2023
@r3stl355
Copy link
Contributor

r3stl355 commented Nov 6, 2023

This is coming from object_store in Apache Arrow, maybe need to raise an issue there: https://github.com/apache/arrow-rs/blob/91acfb07a9929a2d6721c5417e47c0c472372a86/object_store/src/path/parts.rs#L91C15

@tustvold
Copy link

tustvold commented Nov 8, 2023

Path safety has been relaxed in the most recent version of object_store 0.8

@bglezseoane
Copy link
Author

Path safety has been relaxed in the most recent version of object_store 0.8

Any idea when the update might be propagated to this library?

@tustvold
Copy link

We're working on getting DataFusion updated currently, it should land at some point in the next few weeks

@rtyler rtyler self-assigned this Jan 3, 2024
@rtyler rtyler added the binding/rust Issues for the Rust crate label Jan 3, 2024
@rtyler rtyler added this to the Rust v0.17 milestone Jan 3, 2024
rtyler added a commit to rtyler/delta-rs that referenced this issue Jan 3, 2024
This test fails on main but passes in this branch because the URL
handling logic introduced properly encodes file URLs. No need for
object_store updates here

Fixes delta-io#1806
@rtyler rtyler closed this as completed in 29f46bb Jan 3, 2024
r3stl355 pushed a commit to r3stl355/delta-rs that referenced this issue Jan 10, 2024
This test fails on main but passes in this branch because the URL
handling logic introduced properly encodes file URLs. No need for
object_store updates here

Fixes delta-io#1806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants