-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better FrontCodedIndexed #13854
better FrontCodedIndexed #13854
Conversation
I marked this PR WIP since I'm still working on improving the performance of
sizes:
|
changes: * Adds 'useIncrementalBuckets' option to 'frontCoded' string encoding strategy, which when set is fed to FrontCodedIndexedWriter when true writes out a v1 FrontCodedIndexed which stores buckets on a prefix of the previous value instead of the first value in the bucket * Default for 'useIncrementalBuckets' is set to true * Update tests to test both modes
29aab3f
to
4efbfdc
Compare
smooshDirGeneric = FileUtils.createTempDir(); | ||
fileGeneric = File.createTempFile("genericIndexedBenchmark", "meta"); | ||
|
||
smooshDirFrontCodedIncrementalBuckets = FileUtils.createTempDir(); | ||
fileFrontCodedIncrementalBuckets = File.createTempFile("frontCodedIndexedBenchmarkIncrementalBuckets", "meta"); |
Check warning
Code scanning / CodeQL
Local information disclosure in a temporary directory
getting closer 🚀
bucket size of 4 seems equivalent (or faster?), bucket size of 16 is ~15-20ns slower which is maybe ok given that it can be a lot more effective... |
slightly better by returning the values with prefix 0 directly:
|
Should something be added here: https://github.com/apache/druid/blob/master/web-console/src/druid-models/index-spec/index-spec.tsx#L28 here https://github.com/apache/druid/blob/master/web-console/src/druid-models/index-spec/index-spec.tsx#L48 and here https://github.com/apache/druid/blob/master/web-console/src/druid-models/index-spec/index-spec.tsx#L97 ? Or do you want to do that in a follow on PR? |
The numbers for v1 seem nice enough that I'm considering just removing the option to write v0, which means no changes required here, so lets hold off until I decide for sure on that. |
the numbers on v1 look good enough to me to remove the new option for writing v0 in favor of always writing v1, will update the PR description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and I also agree we should always write out incremental buckets.
What are your thoughts on when front-coded should be the default encoding? |
* This method uses this shared prefix length to skip more expensive byte by byte full value comparisons when | ||
* possible by comparing the shared prefix length with the prefix length of the fragment. Since the bucket is always | ||
* sorted, prefix lengths shrink as you progress to higher indexes, and we can use this to reason that a fragment | ||
* with a longer prefix length than the shared prefix will always sort before the value we are looking for, and values | ||
* which have a shorter prefix will always be greater than the value we are looking for, so we only need to do a | ||
* full comparison if the prefix length is the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this javadoc needs adjusted, but i will just do this on a later PR
Unfortunately I think since I just bumped the version of this thing, at earliest it could be the version after the next release to make rollback to the previous version an option when upgrading. |
Description
Iterating on #12277 to continue experimentation, this PR introduces 'V1' of
FrontCodedIndexed
, the backing structure of thefrontCoded
string encoding strategy. This version stores values in buckets based on a prefix of the previous value instead of the first value in the bucket like v0. This improves overall storage size, particularly with using larger buckets.Internally, it adds a new
useIncrementalBuckets
option toFrontCodedIndexedWriter
, which is always set to true to cause it to write out the new version ofFrontCodedIndexed
. The flag remains so tests can still write V0, and V0 coexists with V1 inside of theFrontCodedIndexed
as well for backwards compatibility.Benchmarks
I think the benchmarks look good enough (or even better in many cases) to make this new mode the only mode for front-coded (which itself is still not the default).
Segment sizes
Query times
Indexed direct measurements
Indexed sizes
Key changed/added classes in this PR
FrontCodedIndexedWriter
FrontCodedIndexed
StringEncodingStrategy
This PR has: