-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Z-order sorting option in Optimize operation #1127
Comments
It'd be great if someone could grab this. Z ORDERing provides a "killer feature" functionality to end users. I think this functionality will get us a lot of users. |
Design doc for Z-order is here: https://docs.google.com/document/d/1TYFxAUvhtYqQ6IHAZXjliVuitA5D1u793PMnzsH_3vs/edit#heading=h.q7ah6pca24ul Note that DataFusion has a spilling sort node, which can sort arbitrarily large inputs: apache/datafusion#1568. We should try to reuse that if possible. However, we should still get optimize working in basic cases first. |
Here's the suggested interface for Z ORDER: Z Ordering allows users to fully enjoy the power of data skipping. Even a naive implementation that simply sorts the data would be really useful for the data community. We could release a naive algorithm to start and then swap in a more fancy implementation later. Even an implementation that simply sorts on one column would be a huge win right now. |
TODO:
z-order sort key function
|
I may be completely off here, as I have not yet gone too deep into how the transformation in z-order curves actually work. at least on a high level though I do believe that the arrow-row package - at least conceptually - does something very similar, in that it reduces the multi-columnar representation to one that can be operated on (and sorted) as a single scalar-like value. Main difference would be that it aims to sort lexicographically, rather then by some other means. Thus (and agian, haven't gone deep on that one) I wonder if we can take inspiration from the arrow row crate and "just" change the kernel that handles the reduction do one dimensional space? https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/ |
# Description Implements Z-order in Rust. This is a very basic version that requires loading the whole partition into memory for sorting. In the future, we can implement a DataFusion-based code path that allows sorting with spilling to disk for even larger optimize jobs. The Z-order function here is based on Arrow's row format. We truncate to take the first 16 bytes of data (padding with zeros at end as necessary). So for variable-width columns like strings, this means that we are only using the first 15 bytes of the string (the first byte is used to differentiate null and empty strings, [see row format docs](https://docs.rs/arrow-row/40.0.0/arrow_row/struct.RowConverter.html#variable-length-bytes-including-strings-encoding)). If a user has a string column where they all share the same prefix, this z-order function won't work well for them. But in many common cases it will work. We'll also expose this in Python as a follow up. # Related Issue(s) - closes #1127 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Robert Pack <[email protected]>
Description
Would like to be able to z-order tables. We probably need to implement a z-order sorting function in apache/arrow-rs first.
Use Case
Related Issue(s)
Probably should prioritize #1125 first.
The text was updated successfully, but these errors were encountered: