-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: more consistent handling of partition values and file paths #1661
Conversation
@roeap could https://github.com/apache/incubator-opendal perhaps be an alternative to object_store with regards to that url parsing? |
Maybe :) - generally speaking though are we very happy with obejct_store and there is a high likelyhood that similar things would arise with opendal, as there are even more systems that are supposed to behave the same behind a sahred API ... so getting it fully integrated would likely be a quite massive amount of work. |
Hi @roeap, it seems delta-rs does not encode file paths, at least not completely - when trying to add a file with path:
|
Thanks for the report @natinimni. The handling of path encoding has recently been at least been consolidated a bit, but to improve testing I opened this issue. One of the issues is, that the object_store crate which actually does the encoding took a different approach focussing on an encoding that would produce valid paths for all object store variants. In the next release with the next version of object store included, the encoding was moved a bit closed to at least what spark does... That said, there is one thought I had that I'd appreciate your opinion on. Since delta actually does not use the partition values encoded in the file paths for anything, I planned on making the partitioning scheme configurable and defaulting to not including any partitions values in the file paths - thereby avoiding the issue all together... |
Thank you @roeap for your prompt reply and for opening a new issue on this. Would you consider approving a PR that adds a few key characters to the "INVALID" list (the characters that undergo encoding)? As for you suggestion of not using partition values in paths - it absolutely makes sense. We actually use delta-rs only to read/write the delta log, while using other proprietary tools to write the data files and manage the table, so we won't directly benefit from such a change. I did however considered doing exactly that in our system, and decouple partition values from the path. |
Hey @natinimni, certainly would. Unfortunately also realized I may have made a mistake in a recent refactor w.r.t. path encoding. Essentially we also need to take care of these lines delta-rs/crates/core/src/kernel/snapshot/log_data.rs Lines 161 to 168 in 467afc5
If you are up for it two things would be great.
Since you mention I am happy to help out if this is beyond what you were planning. |
Thanks @roeap - I can sure try to include these in a PR, but I would need some more context :) My motivation of encoding |
Description
This PR does some more cleanup in how we handle encoding of file paths and partition values.
specifically, the following should be fixed
Building on the updates in #1613, this PR moves the encoding/decoding for the
path
fields in add / remove / cdc actions directly to serde,which covers all current code paths where create actions.Last, we never explictly encoded the values when creating a partition path, but let
object_store::Path
handle that for us, which looks a bit different to what spark and pyarrow are doing - not saying they are doing the same thing :).#While debugging #1591, I think I came across some situations where I don not see a clear path how we can make that fully work. I.e.
object_store
is more restrictive in the caracters it allows in a path then sparks implementation is. While file paths are somewhat arbitrary in delta, we cannot process paths, that contain characters object, store deems illegal. As a consequence - I think - there are certain paths that spark will write, that object store will not process, Some example characters include:}|<>~
.Similarly pyarrows partitioning can produce paths the we cannot process - i.e. if a partition value contains some of the chars above. If we create the same table using the rust writer, we can load it again, but for some reason spark cannot.
As a follow-up we should investigate a bit deeper. While we may not be able to cover all cases, I believe there is room for improvements by finetuning what chararcters we encode and where. When we do this we can also cover all of #1215 - at tlest I think we can.
Related Issue(s)