Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include chunk_number in lineage: Per chunk storage #863

Merged
merged 6 commits into from
Aug 16, 2024

Conversation

dachengx
Copy link
Collaborator

@dachengx dachengx commented Aug 4, 2024

What is the problem / what does the code in this PR do

The chunk_number was passed to Context.get_iter to only load a specific chunk. But the previous implementation has weaknesses:

  1. The chunk_number is not tracked by lineage so if you process twice with a different chunk_number, though the lineage is the same, the results are not.
  2. Can not assign which data_type to load by chunk number.

This PR makes sure that the chunk_number is tracked by lineage, and you can assign which chunk to load which data_type.

This PR is designed following plan: https://xe1t-wiki.lngs.infn.it/doku.php?id=xenon:xenonnt:analysis:analysis_tools_team:sr2_processing. We will later make outsource more compatible with strax via this PR.

Depends on #856

Can you briefly describe how it works?

Functionality not implemented:

  1. The chunk_number will not be passed to run selection, because we will not send the metadata including chunk_number to DB.
  2. Change Context.__add_lineage_to_plugin to add chunk_number as a configuration of Plugin.
  3. chunk_number is set to be chunk_number: ty.Optional[ty.Dict[str, ty.List[int]]] = None as argument in functions, e.g. it can be {"raw_records": [0, 1]} or {"peaklets": [1], "lone_hits": [0]}. The latter example is means this PR adds the functionality.
  4. Add a function Context.merge_per_chunk_storage to combine the per chunk storage into normal storage(where chunk_number is not a configuration of Plugin).

Vulnerability of this PR:

  1. When the chunking of dependencies changes, the result will also change. But this change can not be reflected in lineage.

Can you give a minimal working example (or illustrate with a figure)?

st.make("0", "peaklets", chunk_number={"raw_records": [0]})
st.make("0", "peak_basics", chunk_number={"peaklets": [1], "lone_hits": [0]})

Please include the following if applicable:

  • Update the docstring(s)
  • Update the documentation
  • Tests to check the (new) code is working as desired.
  • Does it solve one of the open issues on github?

Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).

@coveralls
Copy link

coveralls commented Aug 4, 2024

Coverage Status

coverage: 89.762% (-0.7%) from 90.445%
when pulling 699f24f on chunk_number_folder
into 81f4250 on master.

@dachengx dachengx requested a review from FaroutYLq August 4, 2024 18:14
@dachengx
Copy link
Collaborator Author

dachengx commented Aug 4, 2024

@FaroutYLq no hurry, I need to test it further. Maybe you are able to hire more people to review.

@dachengx dachengx marked this pull request as ready for review August 4, 2024 18:19
@dachengx dachengx requested a review from MerzJohannes August 4, 2024 18:19
@dachengx
Copy link
Collaborator Author

dachengx commented Aug 5, 2024

Maybe we can add the hash (maybe sha256) of metadata of raw_records to lineage to fix the venerability of this PR. I am thinking about it.

@dachengx dachengx changed the title Include chunk_number in lineage Include chunk_number in lineage: Per chunk storage Aug 5, 2024
Copy link
Collaborator

@FaroutYLq FaroutYLq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for the efforts. I need a bit more time to digest but please let me start asking dumb questions:

  • Do we have any plugin that is supposed have no chunk at all (only with metadata) when computation is finished? I vaguely remember seeing something like this before, but not sure if it is just because that it failed somewhere. I don't have an example on hand unfortunately.
    • If so will this PR breaks things?
  • In this PR, is it expected to have different hash for different chunks, in the same plugin?

I am happy to review this and hope to learn more about the core, but I want to figure out these questions first before diving into it.

@dachengx
Copy link
Collaborator Author

dachengx commented Aug 6, 2024

Hi thanks for the efforts. I need a bit more time to digest but please let me start asking dumb questions:

  • Do we have any plugin that is supposed have no chunk at all (only with metadata) when computation is finished? I vaguely remember seeing something like this before, but not sure if it is just because that it failed somewhere. I don't have an example on hand unfortunately.

    • If so will this PR breaks things?
  • In this PR, is it expected to have different hash for different chunks, in the same plugin?

I am happy to review this and hope to learn more about the core, but I want to figure out these questions first before diving into it.

  • chunk_number tells us which chunk to load but not which chunk to save. So if a data_type has zero chunk, an error will occur.
  • Yes.

@WenzDaniel
Copy link
Collaborator

Is this feature really needed? I have the feeling it will cause us more trouble than we gain. Can you give a few use cases why this feature is required?

@dachengx
Copy link
Collaborator Author

dachengx commented Aug 7, 2024

Is this feature really needed? I have the feeling it will cause us more trouble than we gain. Can you give a few use cases why this feature is required?

It is needed in reprocessing when we want to process a run but do not want to wait for all chunks to be processed in sequence. A homemade and similar feature is already in outsource but according to @FaroutYLq , it is not so compatible with strax. So this feature is needed.

@dachengx
Copy link
Collaborator Author

dachengx commented Aug 8, 2024

Hey, @WenzDaniel. Do you agree with this PR now? We can also have some detailed inspection on it together. You can also list your concerns below in the conversation. Thanks!

@dachengx dachengx merged commit 6f15645 into master Aug 16, 2024
8 checks passed
@dachengx dachengx deleted the chunk_number_folder branch August 16, 2024 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants