Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Different stringification of partition values in reader and writer #1653

Closed
giacomorebecchi opened this issue Sep 21, 2023 · 2 comments · Fixed by #1661
Closed

[python] Different stringification of partition values in reader and writer #1653

giacomorebecchi opened this issue Sep 21, 2023 · 2 comments · Fixed by #1661
Labels
bug Something isn't working

Comments

@giacomorebecchi
Copy link
Contributor

Environment

Delta-rs version: python v0.10.2


Bug

The delta reader and writer treat differently partition values which are not strings. You can clearly see it from:

def __stringify_partition_values(
self, partition_filters: Optional[List[Tuple[str, str, Any]]]
) -> Optional[List[Tuple[str, str, Union[str, List[str]]]]]:
if partition_filters is None:
return partition_filters
out = []
for field, op, value in partition_filters:
str_value: Union[str, List[str]]
if isinstance(value, (list, tuple)):
str_value = [str(val) for val in value]
else:
str_value = str(value)
out.append((field, op, str_value))
return out

def __encode_partition_value(val: Any) -> str:
# Rules based on: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization
if isinstance(val, bool):
return str(val).lower()
if isinstance(val, str):
return val
elif isinstance(val, (int, float)):
return str(val)
elif isinstance(val, date):
return val.isoformat()
elif isinstance(val, datetime):
return val.isoformat(sep=" ")
elif isinstance(val, bytes):
return val.decode("unicode_escape", "backslashreplace")
else:
raise ValueError(f"Could not encode partition value for type: {val}")

An easy fix would be to make the second function accessible, removing the double underscore, and using it inside the reader.

@giacomorebecchi giacomorebecchi added the bug Something isn't working label Sep 21, 2023
@wjones127
Copy link
Collaborator

Could you describe an actual error or failure you have encountered as a result of this? Or even a reproducible example?

@giacomorebecchi
Copy link
Contributor Author

Could you describe an actual error or failure you have encountered as a result of this? Or even a reproducible example?

The following is a MRE that displays that, due to the different transformation of the partition value in the writer and the reader, the file_uris method fails to find the correct file uris. I used the boolean type to display the error, because in the writer you call str(value).lower(), but the same MRE would work for any other type which has a different implementation. Note in the MRE that, since the integers are casted with str(value) in both the reader and the writer. Also, note the fact that the .to_pyarrow_table() does not suffer from this bug.

import deltalake
import pyarrow as pa

fp = "./resources/mytable"

ta = pa.Table.from_pydict(
    {
        "bool_col": [True, False, True, False],
        "int_col": [0, 1, 2, 3],
        "str_col": ["a", "b", "c", "d"],
    }
)

deltalake.write_deltalake(fp, ta, partition_by=["bool_col", "int_col"], mode="overwrite")

dt = deltalake.DeltaTable(fp)

assert dt.to_pyarrow_table(
    filters=[
        ("int_col", "=", 0),
        ("bool_col", "=", True),
    ]
).num_rows == 1

assert len(dt.file_uris(
    partition_filters=[
        ("int_col", "=", 0),
        ("bool_col", "=", "true"),
    ]
)) == 1 # finds the actual partition because it looks for "bool_col=true"

assert len(dt.file_uris(
    partition_filters=[
        ("int_col", "=", 0),
        ("bool_col", "=", True),
    ]
)) == 1  # does not find any partition because it looks for "bool_col=True"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants