-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dictionary builder #853
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Functional and ok, but has failure modes.
klauspost
force-pushed
the
dict-builder
branch
from
August 17, 2023 10:12
431d9fe
to
ff10202
Compare
Add "lazy" matching. Should probably be a skipping attempt instead.
Group output and use offset as secondary. Add step back for long hashes.
kodiakhq bot
referenced
this pull request
in cloudquery/filetypes
Oct 1, 2023
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [github.com/klauspost/compress](https://togithub.com/klauspost/compress) | indirect | minor | `v1.16.7` -> `v1.17.0` | --- ### Release Notes <details> <summary>klauspost/compress (github.com/klauspost/compress)</summary> ### [`v1.17.0`](https://togithub.com/klauspost/compress/releases/tag/v1.17.0) [Compare Source](https://togithub.com/klauspost/compress/compare/v1.16.7...v1.17.0) #### What's Changed - Add dictionary builder by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/853](https://togithub.com/klauspost/compress/pull/853) - Add xerial snappy read/writer by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/838](https://togithub.com/klauspost/compress/pull/838) - flate: Add limited window compression by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/843](https://togithub.com/klauspost/compress/pull/843) - s2: Do 2 overlapping match checks by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/839](https://togithub.com/klauspost/compress/pull/839) - flate: Add amd64 assembly matchlen by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/837](https://togithub.com/klauspost/compress/pull/837) - gzip: Copy bufio.Reader on Reset by [@​thatguystone](https://togithub.com/thatguystone) in [https://github.com/klauspost/compress/pull/860](https://togithub.com/klauspost/compress/pull/860) - zstd: Remove offset from bitReader by [@​greatroar](https://togithub.com/greatroar) in [https://github.com/klauspost/compress/pull/854](https://togithub.com/klauspost/compress/pull/854) - fse, huff0, zstd: Remove always-nil error returns by [@​greatroar](https://togithub.com/greatroar) in [https://github.com/klauspost/compress/pull/857](https://togithub.com/klauspost/compress/pull/857) - tests: unnecessary use of fmt.Sprintf by [@​testwill](https://togithub.com/testwill) in [https://github.com/klauspost/compress/pull/836](https://togithub.com/klauspost/compress/pull/836) - tests: Fix OSS fuzzer t.Run by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/852](https://togithub.com/klauspost/compress/pull/852) - tests: Use Go 1.21.x by [@​klauspost](https://togithub.com/klauspost) in [https://github.com/klauspost/compress/pull/851](https://togithub.com/klauspost/compress/pull/851) #### New Contributors - [@​testwill](https://togithub.com/testwill) made their first contribution in [https://github.com/klauspost/compress/pull/836](https://togithub.com/klauspost/compress/pull/836) - [@​thatguystone](https://togithub.com/thatguystone) made their first contribution in [https://github.com/klauspost/compress/pull/860](https://togithub.com/klauspost/compress/pull/860) **Full Changelog**: klauspost/compress@v1.16.7...v1.17.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4xMDkuNCIsInVwZGF0ZWRJblZlciI6IjM2LjEwOS40IiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9-->
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dictionary builder
This is an experimental dictionary builder for Zstandard, S2, LZ4 and more.
This diverges from the Zstandard dictionary builder, and may have some failure scenarios for very small or uniform inputs.
Dictionaries returned should all be valid, but if very little data is supplied, it may not be able to generate a dictionary.
With a large, diverse sample set, it will generate a dictionary that can compete with the Zstandard dictionary builder,
but for very similar data it will not be able to generate a dictionary that is as good.
Feedback is welcome.
Usage
First of all a collection of samples must be collected.
These samples should be representative of the input data and should not contain any complete duplicates.
Only the beginning of the samples is important, the rest can be truncated.
Beyond something like 64KB the input is not important anymore.
The commandline tool can do this truncation for you.
Command line
To install the command line tool run:
Collect the samples in a directory, for example
samples/
.Then run the command line tool. Basic usage is just to pass the directory with the samples:
This will build a Zstandard dictionary and write it to
dictionary.bin
in the current folder.The dictionary can be used with the Zstandard command line tool:
Options
The command line tool has a few options:
-format
. Output type. "zstd" "s2" or "raw". Default "zstd".Output a dictionary in Zstandard format, S2 format or raw bytes.
The raw bytes can be used with Deflate, LZ4, etc.
-hash
Hash bytes match length. Minimum match length. Must be 4-8 (inclusive) Default 6.The hash bytes are used to define the shortest matches to look for.
Shorter matches can generate a more fractured dictionary with less compression, but can for certain inputs be better.
Usually lengths around 6-8 are best.
-len
Specify custom output size. Default 114688.-max
Max input length to index per input file. Default 32768. All inputs are truncated to this.-o
Output name. Defaultdictionary.bin
.-q
Do not print progress-dictID
zstd dictionary ID. 0 will be random. Default 0.-zcompat
Generate dictionary compatible with zstd 1.5.5 and older. Default false.-zlevel
Zstandard compression level.The Zstandard compression level to use when compressing the samples.
The dictionary will be built using the specified encoder level,
which will reflect speed and make the dictionary tailored for that level.
Default will use level 4 (best).
Valid values are 1-4, where 1 = fastest, 2 = default, 3 = better, 4 = best.
Library
The
github.com/klaupost/compress/dict
package can be used to build dictionaries in code.The caller must supply a collection of (pre-truncated) samples, and the options to use.
The options largely correspond to the command line options.
There are similar functions for S2 and raw dictionaries (
BuildS2Dict
andBuildRawDict
).