[WEBSITE] Blog posts on multi-column sorting implementation #264

alamb · 2022-10-30T12:31:30Z

This Blog post describes the row format introduced in apache/arrow-rs#2593

github-actions · 2022-10-30T12:31:46Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

tustvold

Done a first pass, looking good

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md

Co-authored-by: Raphael Taylor-Davies <[email protected]>

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

pitrou · 2022-11-02T08:07:10Z

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md

+│  "Bar"   │ ───────────────▶│ 01  │
+└──────────┘                 └─────┘
+┌──────────┐                 ┌─────┬─────┐
+│"Fabulous"│ ───────────────▶│ 01  │ 02  │


If "Bar" is 01 and "Fabulous" is 01 02, how do you distinguish between both when you encounter a 01 byte?

~~Since the rows are variable length, each also has a length.~~

~~Thus, in this case since the lengths are different (and the length is stored along with the row) "Bar" ([01]) is shorter and thus sorts before "Fabulous" [01 , 02])~~

~~Perhaps @tustvold can confirm~~

~~We should probably make it clearer in the text that the row format includes a length as well~~

Edit: I was incorrect

We don't store lengths along with the rows, in the case of dictionary keys, they are stored null terminated. This is how we are able to distinguish

pitrou · 2022-11-02T08:12:51Z

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md

+
+One detail we have so far ignored over is how to support ascending and descending sorts (e.g. `ASC` or `DESC` in SQL). The Arrow Rust row format supports these options by simply inverting the bytes of the encoded representation, except the initial byte used for nullability encoding, on a per column basis.
+
+Similarly, supporting SQL compatible sorting also requires a format that can specify the order of `NULL`s (before or after all non `NULL` values). The row format supports this option by optionally encoding nulls as `0xFF` instead of `0x00` on a per column basis.


So you have to escape 00 and FF bytes in the input to make sure they aren't confused with NULLs, right?
Also, do you try to handle floating-point NaNs in a specific way?

Perhaps @tustvold can weigh in here

So you have to escape 00 and FF bytes in the input to make sure they aren't confused with NULLs, right?

The encoding is designed in such a way that this isn't necessary, at no point is it ambiguous as to whether a byte is part of a sentinel (e.g. null) or value data

do you try to handle floating-point NaNs in a specific way?

Nans are ordered according to the IEEE 754 (2008 revision) total order predicate

I added a sentence explaining this design in 7f89c31

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

Co-authored-by: Paddy Horan <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

Co-authored-by: Raphael Taylor-Davies <[email protected]>

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md

Co-authored-by: Raphael Taylor-Davies <[email protected]>

alamb

Ok, I think this blog is basically ready to go from my perspectives. I'll aim for a Monday Nov 7 publish unless there are other comments people would like to provide

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

Co-authored-by: Raphael Taylor-Davies <[email protected]>

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

Co-authored-by: Sutou Kouhei <[email protected]>

[WEBSITE] Blog posts on multi-column sorting implementation

795bdb5

alamb requested a review from tustvold October 30, 2022 12:32

tustvold reviewed Oct 30, 2022

View reviewed changes

alamb and others added 7 commits October 31, 2022 08:55

Apply suggestions from code review from @tustvold

3edbe04

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

1018510

Co-authored-by: Raphael Taylor-Davies <[email protected]>

fix: Use hex in signed integer example

c9820be

fix smart quots

0a3f115

Update example to use ascending sorts

bf7760b

Update example to use ascending sorts

ea989e3

Wordsmithing, diagram tweaks, add performance summary to introduction

578c271

tustvold approved these changes Nov 1, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

tustvold reviewed Nov 1, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

pitrou reviewed Nov 2, 2022

View reviewed changes

paddyhoran reviewed Nov 2, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

alamb and others added 2 commits November 2, 2022 14:55

Apply suggestions from code review

df7b21e

Co-authored-by: Paddy Horan <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

b82344c

Co-authored-by: Raphael Taylor-Davies <[email protected]>

tustvold reviewed Nov 4, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md Outdated Show resolved Hide resolved

tustvold reviewed Nov 4, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md Outdated Show resolved Hide resolved

alamb and others added 3 commits November 4, 2022 11:36

Apply suggestions from code review

3798ff5

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Apply suggestions from code review

69af2ad

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Add sentence motivating why escaping is unecessary in row format

7f89c31

alamb commented Nov 4, 2022

View reviewed changes

whitespace engineering

105c2ac

tustvold approved these changes Nov 4, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

alamb and others added 2 commits November 5, 2022 05:48

Apply suggestions from code review

9570df7

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md

21d7dcf

Co-authored-by: Raphael Taylor-Davies <[email protected]>

kou reviewed Nov 5, 2022

View reviewed changes

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md Outdated Show resolved Hide resolved

_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md Outdated Show resolved Hide resolved

alamb and others added 3 commits November 7, 2022 07:10

Apply suggestions from code review

0d0e178

Co-authored-by: Sutou Kouhei <[email protected]>

Update date to 2022-11-07

2c76e53

Apply final edits from @tustvold

d8d2b81

alamb merged commit 4920b06 into apache:master Nov 7, 2022

alamb deleted the alamb/multi-column-sorts-part-1 branch November 7, 2022 12:18

alamb mentioned this pull request Nov 7, 2022

Fix post date for Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, Part 2 #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WEBSITE] Blog posts on multi-column sorting implementation #264

[WEBSITE] Blog posts on multi-column sorting implementation #264

alamb commented Oct 30, 2022

github-actions bot commented Oct 30, 2022

tustvold left a comment

pitrou Nov 2, 2022

alamb Nov 2, 2022 •

edited

Loading

tustvold Nov 4, 2022 •

edited

Loading

pitrou Nov 2, 2022

alamb Nov 2, 2022

tustvold Nov 4, 2022

alamb Nov 4, 2022

alamb left a comment


		One detail we have so far ignored over is how to support ascending and descending sorts (e.g. `ASC` or `DESC` in SQL). The Arrow Rust row format supports these options by simply inverting the bytes of the encoded representation, except the initial byte used for nullability encoding, on a per column basis.

		Similarly, supporting SQL compatible sorting also requires a format that can specify the order of `NULL`s (before or after all non `NULL` values). The row format supports this option by optionally encoding nulls as `0xFF` instead of `0x00` on a per column basis.

[WEBSITE] Blog posts on multi-column sorting implementation #264

[WEBSITE] Blog posts on multi-column sorting implementation #264

Conversation

alamb commented Oct 30, 2022

github-actions bot commented Oct 30, 2022

tustvold left a comment

Choose a reason for hiding this comment

pitrou Nov 2, 2022

Choose a reason for hiding this comment

alamb Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

tustvold Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

pitrou Nov 2, 2022

Choose a reason for hiding this comment

alamb Nov 2, 2022

Choose a reason for hiding this comment

tustvold Nov 4, 2022

Choose a reason for hiding this comment

alamb Nov 4, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 2, 2022 •

edited

Loading

tustvold Nov 4, 2022 •

edited

Loading