-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WEBSITE] Blog posts on multi-column sorting implementation #264
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done a first pass, looking good
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
│ "Bar" │ ───────────────▶│ 01 │ | ||
└──────────┘ └─────┘ | ||
┌──────────┐ ┌─────┬─────┐ | ||
│"Fabulous"│ ───────────────▶│ 01 │ 02 │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "Bar" is 01
and "Fabulous" is 01 02
, how do you distinguish between both when you encounter a 01
byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the rows are variable length, each also has a length.
Thus, in this case since the lengths are different (and the length is stored along with the row) "Bar" ([01]
) is shorter and thus sorts before "Fabulous" [01 , 02]
)
Perhaps @tustvold can confirm
We should probably make it clearer in the text that the row format includes a length as well
Edit: I was incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't store lengths along with the rows, in the case of dictionary keys, they are stored null terminated. This is how we are able to distinguish
|
||
One detail we have so far ignored over is how to support ascending and descending sorts (e.g. `ASC` or `DESC` in SQL). The Arrow Rust row format supports these options by simply inverting the bytes of the encoded representation, except the initial byte used for nullability encoding, on a per column basis. | ||
|
||
Similarly, supporting SQL compatible sorting also requires a format that can specify the order of `NULL`s (before or after all non `NULL` values). The row format supports this option by optionally encoding nulls as `0xFF` instead of `0x00` on a per column basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you have to escape 00
and FF
bytes in the input to make sure they aren't confused with NULLs, right?
Also, do you try to handle floating-point NaNs in a specific way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @tustvold can weigh in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you have to escape 00 and FF bytes in the input to make sure they aren't confused with NULLs, right?
The encoding is designed in such a way that this isn't necessary, at no point is it ambiguous as to whether a byte is part of a sentinel (e.g. null) or value data
do you try to handle floating-point NaNs in a specific way?
Nans are ordered according to the IEEE 754 (2008 revision) total order predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sentence explaining this design in 7f89c31
Co-authored-by: Paddy Horan <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this blog is basically ready to go from my perspectives. I'll aim for a Monday Nov 7 publish unless there are other comments people would like to provide
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
This Blog post describes the row format introduced in apache/arrow-rs#2593