Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow splitting in rechunking #865

Merged
merged 14 commits into from
Aug 16, 2024
Merged

Allow splitting in rechunking #865

merged 14 commits into from
Aug 16, 2024

Conversation

dachengx
Copy link
Collaborator

@dachengx dachengx commented Aug 7, 2024

What is the problem / what does the code in this PR do

Previously, "rechunk" in strax is equivalent to merging. Say, data_type A depends on B. When the chunk of B is very large, inevitability A will be large and even larger than target_size_mb, which will make target_size_mb not working.

This PR allows the chunks to be split. It ALSO allows the chunks of superruns to be split, which might cause inconsistency, so I suggest a minor or even major bump.

TODOs:

  • More tests need to be done to validate that the results are the same before and after the PR.
  • Change straxen to inherit from strax.chunk.DEFAULT_CHUNK_SPLIT_NS instead of hardcoded safe_break_in_pulses.

Can you briefly describe how it works?

There are several things changed to make this happen:

  1. Return a list of chunks in functions: Rechunker.receive and Rechunker.flush, because the splitting happens.
  2. Change the behavior of Saver.save_from to accept a list of chunks from Rechunker.receive and Rechunker.flush.
  3. Change the behavior of SaverSpy._save_chunk to accept a list of chunks from Rechunker.receive.
  4. Add function _split_subruns_in_chunk to split the information of subruns.
  5. Add a variable DEFAULT_CHUNK_SPLIT_NS (default: 1000, from straxen) which is the required gap between items when splitting. Actually, this is only strictly needed by raw_records.

The splitting and merging will both happen to make sure that the size of a chunk is similar to the target_size_mb.

Can you give a minimal working example (or illustrate with a figure)?

By running:

import straxen
from straxen.test_utils import nt_test_run_id


run_id = nt_test_run_id
st = straxen.test_utils.nt_test_context()

source_directory = os.path.join(st.get_source_sf(nt_test_run_id, 'raw_records')[0].path, str(st.key_for(nt_test_run_id, 'raw_records')))

strax.rechunker(
    source_directory=source_directory,
    dest_directory=source_directory.replace('strax_test_data', 'strax_test_data_split'),
    replace=False,
    target_size_mb=3,  # when setting `target_size_mb` to 1, `CannotSplit` will occur
    parallel=False,
)

You will get split chunks in ./strax_test_data_split/012882-raw_records-z7q2d2ye2t;

total 4.4M
drwxrwxr-x 2 xudc xudc 4.0K Aug  8 14:46 .
drwxrwxr-x 3 xudc xudc 4.0K Aug  8 14:46 ..
-rw-rw-r-- 1 xudc xudc 1.6M Aug  8 14:46 raw_records-z7q2d2ye2t-000000
-rw-rw-r-- 1 xudc xudc 1.5M Aug  8 14:46 raw_records-z7q2d2ye2t-000001
-rw-rw-r-- 1 xudc xudc 1.5M Aug  8 14:46 raw_records-z7q2d2ye2t-000002
-rw-rw-r-- 1 xudc xudc 2.4K Aug  8 14:46 raw_records-z7q2d2ye2t-metadata.json

in raw_records-z7q2d2ye2t-metadata.json:

"chunks": [
    {
        "chunk_i": 0,
        "end": 1874999060,
        "filename": "raw_records-z7q2d2ye2t-000000",
        "filesize": 1611567,
        "first_endtime": 125000600,
        "first_time": 124999590,
        "last_endtime": 1625812250,
        "last_time": 1625811240,
        "n": 12278,
        "nbytes": 2995832,
        "run_id": "012882",
        "start": 124900000,
        "subruns": null
    },
    {
        "chunk_i": 1,
        "end": 3376217870,
        "filename": "raw_records-z7q2d2ye2t-000001",
        "filesize": 1471147,
        "first_endtime": 1875000660,
        "first_time": 1874999560,
        "last_endtime": 3375000880,
        "last_time": 3375000850,
        "n": 11002,
        "nbytes": 2684488,
        "run_id": "012882",
        "start": 1874999060,
        "subruns": null
    },
    {
        "chunk_i": 2,
        "end": 4876677420,
        "filename": "raw_records-z7q2d2ye2t-000002",
        "filesize": 1470734,
        "first_endtime": 3376219380,
        "first_time": 3376218370,
        "last_endtime": 4876676730,
        "last_time": 4876676080,
        "n": 11165,
        "nbytes": 2724260,
        "run_id": "012882",
        "start": 3376217870,
        "subruns": null
    }
],

To test whether they are the same:

import numpy as np

raw_records = strax.dry_load_files('./strax_test_data/012882-raw_records-z7q2d2ye2t')
raw_records_split = strax.dry_load_files('./strax_test_data_split/012882-raw_records-z7q2d2ye2t')

for name in raw_records.dtype.names:
    assert np.all(raw_records[name] == raw_records_split[name])

Please include the following if applicable:

  • Update the docstring(s)
  • Update the documentation
  • Tests to check the (new) code is working as desired.
  • Does it solve one of the open issues on github?

Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).

@coveralls
Copy link

coveralls commented Aug 7, 2024

Coverage Status

coverage: 89.567% (-0.2%) from 89.762%
when pulling 5206a3d on rechunk_split
into 6f15645 on master.

@dachengx dachengx marked this pull request as ready for review August 7, 2024 13:49
@dachengx dachengx requested a review from MerzJohannes August 7, 2024 14:14
@dachengx dachengx requested review from WenzDaniel and yuema137 August 7, 2024 14:24
@dachengx dachengx merged commit a38d09e into master Aug 16, 2024
8 checks passed
@dachengx dachengx deleted the rechunk_split branch August 16, 2024 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants