Include `chunk_number` in lineage: Per chunk storage #863

dachengx · 2024-08-04T16:37:12Z

What is the problem / what does the code in this PR do

The chunk_number was passed to Context.get_iter to only load a specific chunk. But the previous implementation has weaknesses:

The chunk_number is not tracked by lineage so if you process twice with a different chunk_number, though the lineage is the same, the results are not.
Can not assign which data_type to load by chunk number.

This PR makes sure that the chunk_number is tracked by lineage, and you can assign which chunk to load which data_type.

This PR is designed following plan: https://xe1t-wiki.lngs.infn.it/doku.php?id=xenon:xenonnt:analysis:analysis_tools_team:sr2_processing. We will later make outsource more compatible with strax via this PR.

Depends on #856

Can you briefly describe how it works?

Functionality not implemented:

The chunk_number will not be passed to run selection, because we will not send the metadata including chunk_number to DB.
Change Context.__add_lineage_to_plugin to add chunk_number as a configuration of Plugin.
chunk_number is set to be chunk_number: ty.Optional[ty.Dict[str, ty.List[int]]] = None as argument in functions, e.g. it can be {"raw_records": [0, 1]} or {"peaklets": [1], "lone_hits": [0]}. The latter example is means this PR adds the functionality.
Add a function Context.merge_per_chunk_storage to combine the per chunk storage into normal storage(where chunk_number is not a configuration of Plugin).

Vulnerability of this PR:

When the chunking of dependencies changes, the result will also change. But this change can not be reflected in lineage.

Can you give a minimal working example (or illustrate with a figure)?

st.make("0", "peaklets", chunk_number={"raw_records": [0]})
st.make("0", "peak_basics", chunk_number={"peaklets": [1], "lone_hits": [0]})

Please include the following if applicable:

Update the docstring(s)
Update the documentation
Tests to check the (new) code is working as desired.
Does it solve one of the open issues on github?

Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).

coveralls · 2024-08-04T16:54:19Z

coverage: 89.762% (-0.7%) from 90.445%
when pulling 699f24f on chunk_number_folder
into 81f4250 on master.

dachengx · 2024-08-04T18:15:42Z

@FaroutYLq no hurry, I need to test it further. Maybe you are able to hire more people to review.

dachengx · 2024-08-05T17:28:08Z

Maybe we can add the hash (maybe sha256) of metadata of raw_records to lineage to fix the venerability of this PR. I am thinking about it.

FaroutYLq

Hi thanks for the efforts. I need a bit more time to digest but please let me start asking dumb questions:

Do we have any plugin that is supposed have no chunk at all (only with metadata) when computation is finished? I vaguely remember seeing something like this before, but not sure if it is just because that it failed somewhere. I don't have an example on hand unfortunately.
- If so will this PR breaks things?
In this PR, is it expected to have different hash for different chunks, in the same plugin?

I am happy to review this and hope to learn more about the core, but I want to figure out these questions first before diving into it.

dachengx · 2024-08-06T23:27:54Z

Hi thanks for the efforts. I need a bit more time to digest but please let me start asking dumb questions:

Do we have any plugin that is supposed have no chunk at all (only with metadata) when computation is finished? I vaguely remember seeing something like this before, but not sure if it is just because that it failed somewhere. I don't have an example on hand unfortunately.

If so will this PR breaks things?

In this PR, is it expected to have different hash for different chunks, in the same plugin?

I am happy to review this and hope to learn more about the core, but I want to figure out these questions first before diving into it.

chunk_number tells us which chunk to load but not which chunk to save. So if a data_type has zero chunk, an error will occur.
Yes.

WenzDaniel · 2024-08-07T07:09:05Z

Is this feature really needed? I have the feeling it will cause us more trouble than we gain. Can you give a few use cases why this feature is required?

dachengx · 2024-08-07T08:47:46Z

Is this feature really needed? I have the feeling it will cause us more trouble than we gain. Can you give a few use cases why this feature is required?

It is needed in reprocessing when we want to process a run but do not want to wait for all chunks to be processed in sequence. A homemade and similar feature is already in outsource but according to @FaroutYLq , it is not so compatible with strax. So this feature is needed.

dachengx · 2024-08-08T16:23:51Z

Hey, @WenzDaniel. Do you agree with this PR now? We can also have some detailed inspection on it together. You can also list your concerns below in the conversation. Thanks!

Include chunk_number in lineage

ebfbc29

dachengx requested a review from FaroutYLq August 4, 2024 18:14

dachengx marked this pull request as ready for review August 4, 2024 18:19

dachengx requested a review from MerzJohannes August 4, 2024 18:19

Debug for assigning 0 to chunk_number

c130ffd

dachengx changed the title ~~Include chunk_number in lineage~~ Include chunk_number in lineage: Per chunk storage Aug 5, 2024

FaroutYLq reviewed Aug 6, 2024

View reviewed changes

dachengx added 2 commits August 8, 2024 07:47

Merge per chunk processed storages

e56365e

Limit chunk_number to be list

15649e9

dachengx added 2 commits August 9, 2024 13:47

Add more check to chunk_number

d1dde6f

Merge branch 'master' into chunk_number_folder

699f24f

dachengx merged commit 6f15645 into master Aug 16, 2024
8 checks passed

dachengx deleted the chunk_number_folder branch August 16, 2024 03:52

This was referenced Aug 23, 2024

Add test about merge_per_chunk_storage #874

Closed

Prohibit usage of chunk_number for special plugins like OverlapWindowPlugin #875

Closed

dachengx mentioned this pull request Sep 27, 2024

Want to refactor runstrax.py and strax-wrapper.sh XENONnT/outsource#142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include `chunk_number` in lineage: Per chunk storage #863

Include `chunk_number` in lineage: Per chunk storage #863

dachengx commented Aug 4, 2024 •

edited

Loading

coveralls commented Aug 4, 2024 •

edited

Loading

dachengx commented Aug 4, 2024

dachengx commented Aug 5, 2024 •

edited

Loading

FaroutYLq left a comment •

edited

Loading

dachengx commented Aug 6, 2024 •

edited

Loading

WenzDaniel commented Aug 7, 2024

dachengx commented Aug 7, 2024

dachengx commented Aug 8, 2024

Include chunk_number in lineage: Per chunk storage #863

Include chunk_number in lineage: Per chunk storage #863

Conversation

dachengx commented Aug 4, 2024 • edited Loading

coveralls commented Aug 4, 2024 • edited Loading

dachengx commented Aug 4, 2024

dachengx commented Aug 5, 2024 • edited Loading

FaroutYLq left a comment • edited Loading

Choose a reason for hiding this comment

dachengx commented Aug 6, 2024 • edited Loading

WenzDaniel commented Aug 7, 2024

dachengx commented Aug 7, 2024

dachengx commented Aug 8, 2024

Include `chunk_number` in lineage: Per chunk storage #863

Include `chunk_number` in lineage: Per chunk storage #863

dachengx commented Aug 4, 2024 •

edited

Loading

coveralls commented Aug 4, 2024 •

edited

Loading

dachengx commented Aug 5, 2024 •

edited

Loading

FaroutYLq left a comment •

edited

Loading

dachengx commented Aug 6, 2024 •

edited

Loading