-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow splitting in rechunking #865
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the problem / what does the code in this PR do
Previously, "rechunk" in strax is equivalent to merging. Say, data_type A depends on B. When the chunk of B is very large, inevitability A will be large and even larger than
target_size_mb
, which will maketarget_size_mb
not working.This PR allows the chunks to be split. It ALSO allows the chunks of superruns to be split, which might cause inconsistency, so I suggest a minor or even major bump.
TODOs:
strax.chunk.DEFAULT_CHUNK_SPLIT_NS
instead of hardcodedsafe_break_in_pulses
.Can you briefly describe how it works?
There are several things changed to make this happen:
Rechunker.receive
andRechunker.flush
, because the splitting happens.Saver.save_from
to accept a list of chunks fromRechunker.receive
andRechunker.flush
.SaverSpy._save_chunk
to accept a list of chunks fromRechunker.receive
._split_subruns_in_chunk
to split the information of subruns.DEFAULT_CHUNK_SPLIT_NS
(default: 1000, from straxen) which is the required gap between items when splitting. Actually, this is only strictly needed byraw_records
.The splitting and merging will both happen to make sure that the size of a chunk is similar to the
target_size_mb
.Can you give a minimal working example (or illustrate with a figure)?
By running:
You will get split chunks in
./strax_test_data_split/012882-raw_records-z7q2d2ye2t
;in
raw_records-z7q2d2ye2t-metadata.json
:To test whether they are the same:
Please include the following if applicable:
Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).