Allow splitting in rechunking #865

dachengx · 2024-08-07T06:38:45Z

What is the problem / what does the code in this PR do

Previously, "rechunk" in strax is equivalent to merging. Say, data_type A depends on B. When the chunk of B is very large, inevitability A will be large and even larger than target_size_mb, which will make target_size_mb not working.

This PR allows the chunks to be split. It ALSO allows the chunks of superruns to be split, which might cause inconsistency, so I suggest a minor or even major bump.

TODOs:

More tests need to be done to validate that the results are the same before and after the PR.
Change straxen to inherit from strax.chunk.DEFAULT_CHUNK_SPLIT_NS instead of hardcoded safe_break_in_pulses.

Can you briefly describe how it works?

There are several things changed to make this happen:

Return a list of chunks in functions: Rechunker.receive and Rechunker.flush, because the splitting happens.
Change the behavior of Saver.save_from to accept a list of chunks from Rechunker.receive and Rechunker.flush.
Change the behavior of SaverSpy._save_chunk to accept a list of chunks from Rechunker.receive.
Add function _split_subruns_in_chunk to split the information of subruns.
Add a variable DEFAULT_CHUNK_SPLIT_NS (default: 1000, from straxen) which is the required gap between items when splitting. Actually, this is only strictly needed by raw_records.

The splitting and merging will both happen to make sure that the size of a chunk is similar to the target_size_mb.

Can you give a minimal working example (or illustrate with a figure)?

By running:

import straxen
from straxen.test_utils import nt_test_run_id


run_id = nt_test_run_id
st = straxen.test_utils.nt_test_context()

source_directory = os.path.join(st.get_source_sf(nt_test_run_id, 'raw_records')[0].path, str(st.key_for(nt_test_run_id, 'raw_records')))

strax.rechunker(
    source_directory=source_directory,
    dest_directory=source_directory.replace('strax_test_data', 'strax_test_data_split'),
    replace=False,
    target_size_mb=3,  # when setting `target_size_mb` to 1, `CannotSplit` will occur
    parallel=False,
)

You will get split chunks in ./strax_test_data_split/012882-raw_records-z7q2d2ye2t;

total 4.4M
drwxrwxr-x 2 xudc xudc 4.0K Aug  8 14:46 .
drwxrwxr-x 3 xudc xudc 4.0K Aug  8 14:46 ..
-rw-rw-r-- 1 xudc xudc 1.6M Aug  8 14:46 raw_records-z7q2d2ye2t-000000
-rw-rw-r-- 1 xudc xudc 1.5M Aug  8 14:46 raw_records-z7q2d2ye2t-000001
-rw-rw-r-- 1 xudc xudc 1.5M Aug  8 14:46 raw_records-z7q2d2ye2t-000002
-rw-rw-r-- 1 xudc xudc 2.4K Aug  8 14:46 raw_records-z7q2d2ye2t-metadata.json

in raw_records-z7q2d2ye2t-metadata.json:

"chunks": [
    {
        "chunk_i": 0,
        "end": 1874999060,
        "filename": "raw_records-z7q2d2ye2t-000000",
        "filesize": 1611567,
        "first_endtime": 125000600,
        "first_time": 124999590,
        "last_endtime": 1625812250,
        "last_time": 1625811240,
        "n": 12278,
        "nbytes": 2995832,
        "run_id": "012882",
        "start": 124900000,
        "subruns": null
    },
    {
        "chunk_i": 1,
        "end": 3376217870,
        "filename": "raw_records-z7q2d2ye2t-000001",
        "filesize": 1471147,
        "first_endtime": 1875000660,
        "first_time": 1874999560,
        "last_endtime": 3375000880,
        "last_time": 3375000850,
        "n": 11002,
        "nbytes": 2684488,
        "run_id": "012882",
        "start": 1874999060,
        "subruns": null
    },
    {
        "chunk_i": 2,
        "end": 4876677420,
        "filename": "raw_records-z7q2d2ye2t-000002",
        "filesize": 1470734,
        "first_endtime": 3376219380,
        "first_time": 3376218370,
        "last_endtime": 4876676730,
        "last_time": 4876676080,
        "n": 11165,
        "nbytes": 2724260,
        "run_id": "012882",
        "start": 3376217870,
        "subruns": null
    }
],

To test whether they are the same:

import numpy as np

raw_records = strax.dry_load_files('./strax_test_data/012882-raw_records-z7q2d2ye2t')
raw_records_split = strax.dry_load_files('./strax_test_data_split/012882-raw_records-z7q2d2ye2t')

for name in raw_records.dtype.names:
    assert np.all(raw_records[name] == raw_records_split[name])

Please include the following if applicable:

Update the docstring(s)
Update the documentation
Tests to check the (new) code is working as desired.
Does it solve one of the open issues on github?

Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).

coveralls · 2024-08-07T13:40:46Z

coverage: 89.567% (-0.2%) from 89.762%
when pulling 5206a3d on rechunk_split
into 6f15645 on master.

…74/job/28464419268?pr=865

dachengx added 5 commits August 7, 2024 00:54

Allow splitting in rechunking

0e8492b

Debug, receive returns list

c50d60e

Fix bug where the last index in split instruction is too large

290fc33

Debug, flush should return nothing when cache is None

55cf1d3

Update subruns information also in Chunk.split

10c330f

Subruns splitting in Chunk.split

0158bee

dachengx marked this pull request as ready for review August 7, 2024 13:49

dachengx added 2 commits August 7, 2024 08:50

Return None if no subruns waiting for decision

1b89dce

Bug fix: https://github.com/AxFoundation/strax/actions/runs/102856496…

50d6627

…74/job/28464419268?pr=865

dachengx requested a review from MerzJohannes August 7, 2024 14:14

Do not test the number of superrun chunks because they will be split

35deff7

dachengx requested review from WenzDaniel and yuema137 August 7, 2024 14:24

dachengx added 4 commits August 8, 2024 00:18

Fix bug for strax.general.diff if endtime is not sorted

91c4ee8

Debug

9eb4583

Debug, numba dose not understand how to extract dtype

9d19442

Debug for incorrect calculation of diff

0ffbddf

dachengx mentioned this pull request Aug 9, 2024

Inherit DEFAULT_CHUNK_SPLIT_NS from strax XENONnT/straxen#1405

Merged

4 tasks

Merge branch 'master' into rechunk_split

5206a3d

dachengx merged commit a38d09e into master Aug 16, 2024
8 checks passed

dachengx deleted the rechunk_split branch August 16, 2024 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow splitting in rechunking #865

Allow splitting in rechunking #865

dachengx commented Aug 7, 2024 •

edited

Loading

coveralls commented Aug 7, 2024 •

edited

Loading

Allow splitting in rechunking #865

Allow splitting in rechunking #865

Conversation

dachengx commented Aug 7, 2024 • edited Loading

coveralls commented Aug 7, 2024 • edited Loading

dachengx commented Aug 7, 2024 •

edited

Loading

coveralls commented Aug 7, 2024 •

edited

Loading