-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order dictionary indicies to prevent 2x-50x size regression #94
Open
sdruzkin
wants to merge
1
commit into
facebookincubator:main
Choose a base branch
from
sdruzkin:export-D63964981
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
facebook-github-bot
added
the
CLA Signed
This label is managed by the Meta Open Source bot.
label
Oct 7, 2024
This pull request was exported from Phabricator. Differential Revision: D63964981 |
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Oct 7, 2024
…incubator#94) Summary: I checked Vader [results](https://fburl.com/daiquery/4kk9jnm1) and found significant size differences between Nimble JNI and Velox rewrites for some files. After some debugging I learned that we almost never get exactly the same Nimble files. You can do the rewrite 100 times and get 100 files with different sizes, meaning that Nimble writer has non-deterministic behavior. For example in this example P1630232570 **rewriting the same file multiple times yields files with sizes from 112MB to 364MB**. After checking the encodings and enabling NIMBLE_ENCODING_SELECTION_DEBUG in the EncodingSelectionPolicy I **root-caused it to the dictionary encoding. All rewrites of the same files always had the same alphabet in different order, and that also caused dictionary indicies to be totally different.** **Random order of the dictionary alphabet was caused by usage of the absl::flat_hash_map for the Statistic.uniqueCounts**. After switching Statistic.uniqueCounts type to F14FastMap, which has consistent iteration order, different rewrites of the same files became consistent. After a bit of digging and comparing encoding selection logs I noticed that the **main size contributor in files with different sizes is the varint encoding of dictionary indicies**. Size of the encoded data depends on the encoded numeric value, the bigger it's the more bytes it needs in the encoded form. So, **I sorted indicies by the alphabet key frequency and put the most frequent keys first**, that would mean that the most frequent keys would get the smallest indicies. In case of the first 127 indicies we would only need 1 byte, for the next 16383 need 2 bytes, then 3, etc. **It consistently produced small files, but still with different sizes overall. I also did a reverse experiment, and it consistently produced large files.** # Samples Sorting alphabet by frequency would obviously impact writer performance, but IMHO it's worth it. In some examples the size regression is up to 580x (!!!), instead of 60MB files we can get a 35GB file for ifr_test_hive_table. Checkout https://docs.google.com/spreadsheets/d/1d7m--4x6e0YddyfTdJvjrkYz-pB11O3sBdNM8usxmZM/edit?usp=sharing. A list of corresponding files can be found here P1634630740. I did testing on some top sized NImble tables and some tables where Vader shows significant size differences between JNI and Velox. `Lil` is the version from the diff, `big` is the version with reverse sorted indicies. Some notable examples - top Nimble tables: P1634619561 Some notable examples - top size differences: P1634619725 # Other Notes 1 As you can see in the spreadsheet in some cases the original Nimble file is still much smaller, it means there are still some optimization and size fixing opportunities. May be sorting the alphabet would also lead to size decrease, or may be there is something else. # Compression It looks like we are not compressing Varint encoding, we should absolutely start doing that. Differential Revision: D63964981
sdruzkin
force-pushed
the
export-D63964981
branch
from
October 7, 2024 16:29
50b7a3d
to
cb989b8
Compare
This pull request was exported from Phabricator. Differential Revision: D63964981 |
…incubator#94) Summary: I checked Vader [results](https://fburl.com/daiquery/4kk9jnm1) and found significant size differences between Nimble JNI and Velox rewrites for some files. After some debugging I learned that we almost never get exactly the same Nimble files. You can do the rewrite 100 times and get 100 files with different sizes, meaning that Nimble writer has non-deterministic behavior. For example in this example P1630232570 **rewriting the same file multiple times yields files with sizes from 112MB to 364MB**. After checking the encodings and enabling NIMBLE_ENCODING_SELECTION_DEBUG in the EncodingSelectionPolicy I **root-caused it to the dictionary encoding. All rewrites of the same files always had the same alphabet in different order, and that also caused dictionary indicies to be totally different.** **Random order of the dictionary alphabet was caused by usage of the absl::flat_hash_map for the Statistic.uniqueCounts**. After switching Statistic.uniqueCounts type to F14FastMap, which has consistent iteration order, different rewrites of the same files became consistent. After a bit of digging and comparing encoding selection logs I noticed that the **main size contributor in files with different sizes is the varint encoding of dictionary indicies**. Size of the encoded data depends on the encoded numeric value, the bigger it's the more bytes it needs in the encoded form. So, **I sorted indicies by the alphabet key frequency and put the most frequent keys first**, that would mean that the most frequent keys would get the smallest indicies. In case of the first 127 indicies we would only need 1 byte, for the next 16383 need 2 bytes, then 3, etc. **It consistently produced small files, but still with different sizes overall. I also did a reverse experiment, and it consistently produced large files.** # Samples Sorting alphabet by frequency would obviously impact writer performance, but IMHO it's worth it. In some examples the size regression is up to 580x (!!!), instead of 60MB files we can get a 35GB file for ifr_test_hive_table. Checkout https://docs.google.com/spreadsheets/d/1d7m--4x6e0YddyfTdJvjrkYz-pB11O3sBdNM8usxmZM/edit?usp=sharing. A list of corresponding files can be found here P1634630740. I did testing on some top sized NImble tables and some tables where Vader shows significant size differences between JNI and Velox. `Lil` is the version from the diff, `big` is the version with reverse sorted indicies. Some notable examples - top Nimble tables: P1634619561 Some notable examples - top size differences: P1634619725 # Other Notes 1 As you can see in the spreadsheet in some cases the original Nimble file is still much smaller, it means there are still some optimization and size fixing opportunities. May be sorting the alphabet would also lead to size decrease, or may be there is something else. # Compression It looks like we are not compressing Varint encoding, we should absolutely start doing that. Differential Revision: D63964981
sdruzkin
force-pushed
the
export-D63964981
branch
from
October 8, 2024 05:06
cb989b8
to
fb1ed44
Compare
This pull request was exported from Phabricator. Differential Revision: D63964981 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
I checked Vader results and found significant size differences between Nimble JNI and Velox rewrites for some files.
After some debugging I learned that we almost never get exactly the same Nimble files. You can do the rewrite 100 times and get 100 files with different sizes, meaning that Nimble writer has non-deterministic behavior. For example in this example P1630232570 rewriting the same file multiple times yields files with sizes from 112MB to 364MB.
After checking the encodings and enabling NIMBLE_ENCODING_SELECTION_DEBUG in the EncodingSelectionPolicy I root-caused it to the dictionary encoding. All rewrites of the same files always had the same alphabet in different order, and that also caused dictionary indicies to be totally different.
Random order of the dictionary alphabet was caused by usage of the absl::flat_hash_map for the Statistic.uniqueCounts. After switching Statistic.uniqueCounts type to F14FastMap, which has consistent iteration order, different rewrites of the same files became consistent.
After a bit of digging and comparing encoding selection logs I noticed that the main size contributor in files with different sizes is the varint encoding of dictionary indicies. Size of the encoded data depends on the encoded numeric value, the bigger it's the more bytes it needs in the encoded form.
So, I sorted indicies by the alphabet key frequency and put the most frequent keys first, that would mean that the most frequent keys would get the smallest indicies. In case of the first 127 indicies we would only need 1 byte, for the next 16383 need 2 bytes, then 3, etc. It consistently produced small files, but still with different sizes overall. I also did a reverse experiment, and it consistently produced large files.
Samples
Sorting alphabet by frequency would obviously impact writer performance, but IMHO it's worth it. In some examples the size regression is up to 580x (!!!), instead of 60MB files we can get a 35GB file for ifr_test_hive_table.
Checkout https://docs.google.com/spreadsheets/d/1d7m--4x6e0YddyfTdJvjrkYz-pB11O3sBdNM8usxmZM/edit?usp=sharing
I did testing on some top sized NImble tables and some tables where Vader shows significant size differences between JNI and Velox.
Lil
is the version from the diff,big
is the version with reverse sorted indicies.Some notable examples - top Nimble tables:
P1634619561
Some notable examples - top size differences:
P1634619725
Other Notes 1
As you can see in the spreadsheet in some cases the original Nimble file is still much smaller, it means there are still some optimization and size fixing opportunities. May be sorting the alphabet would also lead to size decrease, or may be there is something else.
Compression
It looks like we are not compressing Varint encoding, we should absolutely start doing that.
Differential Revision: D63964981