Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Incorrect file URIs when partition values contain escape character #1533

Closed
j-bennet opened this issue Jul 12, 2023 · 1 comment · Fixed by #1613
Closed

[Python] Incorrect file URIs when partition values contain escape character #1533

j-bennet opened this issue Jul 12, 2023 · 1 comment · Fixed by #1613
Labels
bug Something isn't working

Comments

@j-bennet
Copy link

j-bennet commented Jul 12, 2023

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment:

  • Cloud provider:
  • OS: macOS
  • Other:

Bug

What happened:

When partition values contain an escape character (example: letter=%2F%2520%25f), deltalake returns incorrect file_uris - they seem to have an additional level of urlencoding.

What you expected to happen:

file_uris to return uris that correspond to actual file names.

How to reproduce it:

  1. Download data from Delta Acceptance Testing (https://github.com/delta-incubator/dat/)

wget https://github.com/delta-incubator/dat/releases/download/v0.0.2/deltalake-dat-v0.0.2.tar.gz

  1. Extract the data.
  2. Check file_uris:
> ipython
In [1]: from deltalake import DeltaTable

In [2]: dt = DeltaTable("out/reader_tests/generated/multi_partitioned/delta")

In [3]: dt.file_uris()
Out[3]:
['/Users/jbennet/src/delta-rs/python/out/reader_tests/generated/multi_partitioned/delta/letter=%252F%252520%2525f/date=1970-01-01/data=hello/part-00000-a29a3f63-5a26-4127-aeff-e6d27c077917.c000.snappy.parquet',
 '/Users/jbennet/src/delta-rs/python/out/reader_tests/generated/multi_partitioned/delta/letter=b/date=1970-01-01/data=%F0%9F%98%88/part-00000-9363b3d0-34b9-4db4-baf6-9b48cf88ae5b.c000.snappy.parquet']

The actual paths do not correspond to returned urls:

In [4]: ! ls -l /Users/jbennet/src/delta-rs/python/out/reader_tests/generated/multi_partitioned/delta/
total 0
drwxr-xr-x  8 jbennet  staff  256 Jan 24 17:50 _delta_log
drwxr-xr-x  3 jbennet  staff   96 Jan 24 17:50 letter=%2F%2520%25f
drwxr-xr-x  3 jbennet  staff   96 Jan 24 17:50 letter=__HIVE_DEFAULT_PARTITION__
drwxr-xr-x  4 jbennet  staff  128 Jan 24 17:50 letter=a
drwxr-xr-x  4 jbennet  staff  128 Jan 24 17:50 letter=b

Actual path: letter=%2F%2520%25f
Returned by delta-rs: letter=%252F%252520%2525f

More details:

@j-bennet j-bennet added the bug Something isn't working label Jul 12, 2023
@j-bennet j-bennet changed the title [Python] Incorrect file URIs when partition values contain % [Python] Incorrect file URIs when partition values contain escape character Jul 12, 2023
@sherlockbeard
Copy link
Contributor

sherlockbeard commented Jul 23, 2023

May be related to #1079 #1446

wjones127 added a commit that referenced this issue Sep 11, 2023
# Description

In the delta log, paths are percent encoded. We decode them here:


https://github.com/delta-io/delta-rs/blob/787c13a63efa9ada96d303c10c093424215aaa80/rust/src/action/mod.rs#L435-L437

Which is good. But then we've been re-encoding them with `Path::from`.
This PR changes to use `Path::parse` when possible instead. Instead of
propagating errors, we just fallback to `Path::from` for now. Read more
here:
https://docs.rs/object_store/0.7.0/object_store/path/struct.Path.html#encode

# Related Issue(s)

* closes #1533
* closes #1446 
* closes #1079
* closes #1393


# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants