-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove PFNs from a final document we send to WMArchive #10998
Conversation
@amaltaro based on my understanding of the codebase we only need trivial change. So far I only removed PFN part of selected keys which in fact remove PFNArray and PFNArrayRef attributes in a final document. I don't know if you want to remove LFNArray too, the change will be trivial as well. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, can you please provide a diff of the final document with and without this patch? There might be unit tests to help you with that.
When I looked at this code, I found it very confusing and sometimes the same key is present in multiple level of the final dict/json.
The diff is tricky since the data is randomly generated, but here are the examples of docs with and without PFNs: |
This data format conversion looks blurred to me. Reason why I started looking at the unit tests to make more sense of the outcome document. For instance, the LFNArrayRef/PFNArrayRef seem to be like an index of attributes from the LFNArray/PFNArray lists. But then, when the later is shorter than the former how do you know which attribute type is missing in the output. Also looking into the two json files that you shared, I see some strange output for the nested "pfn" key. Now it lists the full PFN instead of its index. So the current fix doesn't seem to make it too much better. |
@vkuznet I just wanted to check whether there is any information that you need from me here? Looking at my previous reply, I think there are still some inconsistencies in the final document that it would be better to understand and get it fixed. |
Jenkins results:
|
@amaltaro could you please review it again, as I wrote I adjusted code to completely remove PFNs, you can see it in new fwjr-no-pfns.json JSON file I produced with new code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, thanks for providing this new approach.
However, even though it does the job, it doesn't look an efficient way to deal with this. Reason is, we actually run the whole logic mapping FJR information into a WMArchive document - expanding everything in memory - to only then go back to the WMArchive document and remove key/value pairs that we do not need.
The proper way to fix it IMO would be, to convert only the necessary information into a WMArchive document. This will save a lot of memory and CPU cycles (and to be frank, it will likely become a less complex logic).
@amaltaro , do you have quantitative confirmation about excessive memory and CPU footprint? We run with PFNs for years and I never heard anything bothering you or other in terms of submission. My point is that I tried a simple approach without spending too much time on optimization. If you have specific numbers to show that it takes that much CPU and that much memory I think it is not worth the effort. If you still insist to make it in a proper way, I'll do it, but it will take time and change of algorithm/code logic. |
Unfortunately I do not! I also believe that collecting such metrics could actually take more time than the actual implementation :) Please don't take me as a picky person, I am just trying to avoid to end up with issues like these: where things are functional, but fail to scale in specific situations. A few weeks ago I looked into this code myself, and indeed it's quite complex and hard to follow. Which is another reason I would be in favor of actually not doing unnecessary things and make it cleaner and more readable. |
Alan, ok, I added new function |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, please point me to the place where in WMCore FWJR is constructed such that we can remove PFNs (if necessary) from there.
There is nothing that needs to be done with the FJR. It's only the source information and the data structure that is parsed in order to produce a WMArchive job document.
Having said that, the optimal change here is to change the "converter" code, such that it disregard input LFNs and PFNs. In other words, instead of cleaning these attributes from the WMArchive job document, we should actually make sure that they don't even get (temporary) created in the document.
Alan, I hope you read my changes, they do exactly what you said. I added new function |
Alan, in addition what I wrote yesterday, I need to say that you're mixing two different issues here. One, which is issue of not sending large docs to MONIT is addressed in this PR, the second issue is (possible) large memory footprint you pointed out. Because current code does
Therefore, we need to separate these two issues, and if memory footprint is bothering you I suggest to open different ticket for it and it should be handled from the view of handling FWJR docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if lfn information is no longer in the WMArchive document, then it does solve the problem reported in the GH issue. However, to me it looks more like a workaround than a real fix.
What I would expect from this code (not necessarily your changes) is:
- load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc
instead, we have either (your 1st and 2nd attempt):
- load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc -> delete unnecessary fields -> create final WMArchive doc; OR
- load FJR -> delete specific fields -> perform fields selection and convert to WMArchive schema -> create WMArchive doc
Don't you think the very top/former model is THE correct way to fix it?
This whole code is hard to read and lack of unit tests make these changes even more challenging (well, there is 1 "unit" test that checks the whole module in shot). That's why I was advocating for an approach that would go straight to the final result - likely making it less complex.
I will try to make sense of this by Monday morning...
@@ -13,18 +13,17 @@ | |||
# convert data format under stpes["cmsRun1"/"logArch1"/"stageOut1"]["output"] | |||
WMARCHIVE_REMOVE_OUTSIDE_LAYER = ["checksum", "dataset"] | |||
# convert to list from str | |||
WMARCHIVE_CONVERT_TO_LIST = ["OutputPFN", "lfn"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this information could be consumed by anyone, but I'd say OutputPFN
should remain in the WMA doc, no?
data = idict.get(step, {}) | ||
for key, values in data.items(): | ||
for elem in values: | ||
for skip in ['pfn', 'InputPFN', 'OutputPFN', 'inputpfns']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be defined in the WMARCHIVE_REMOVE_FIELD variable? Maybe not, because it's actually removing keys from the FJR, right? Still, this would be better to be defined at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move it up in the module under FWJR_REMOVE_STEP_FIELD?
Alan, I think you contradict yourself. Originally, you asked to avoid unnecessary memory consumption by which can only be done before the load FWJR step. Once you load it, the memory is already allocated and consumer. Therefore, your desired approach:
is identical to
because the PFNs already exists in FWJR as far as I can tell. Don't you think so? Both these above approaches will be identical in terms of memory consumption since I don't think that full re-write is answer here. In fact, the second commit I made does exactly what you ask for, it does not delete specific fields explicitly, it skips them, i.e. it is making perform fields selection and convert to WMArchive schema. Once again, please read what Anyway, I think this issue require a chat since it seems to me stuck with understanding how to address it. |
Now that I spent 30min looking into this code, my conclusion is that it's not worth it to go deeper. If we do so, then this should be refactored. Yes, I agree that everything is already loaded in memory and duplicated with the deepcopy, whatever we do now will remain in memory until things go out of scope (next cycle). Valentin, can you please look into the comments made along the code, update it if required, and provide me with a new dump of the outcome (as you've done initially in this PR)? |
Alan, I clean-up code, i.e. removed
Please note, this command generates several docs which are dumped to the stdout. I checked them all for pfn appearance which is no longer there and put to the gist the first dict. Please review the code again. |
Jenkins results:
|
Jenkins results:
|
Alan, do you need anything else on this issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing these further details, Valentin.
Looking into your gist/dict, this is the reason why I really think we should rework this document conversion:
"LFNArray": [
"/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
"/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
"/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/E8099605-8853-E011-A848-0030487A18F2.root",
"/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
],
"LFNArrayRef": [
"outputLFNs",
"lfn",
"skippedFiles",
"inputLFNs",
"fallbackFiles"
],
even if that means a refactoring of the code.
If I understand it correctly - and this output reminds me a couple of the old PhEDEx APIs - where LFNArrayRef
is an ordered list of attributes to be used as a reference for the LFNArray
list, such that:
- LFNArray[0] should be a
outputLFNs
- LFNArray[1] should be a
lfn
- LFNArray[2] should be a
skippedFiles
; and so on.
which clearly is not clear at all(!)
Maybe the expected outcome should be something like:
"LFNArray": [
"/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
"/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
"/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
],
"LFNArrayRef": [
"outputLFNs",
],
thus, reporting only the output files for this job (2 ALCARECO files and the logArchive).
Alan, this getting longer and longer list of changes you desire to have which have nothing to do with original goal of this PR. If you agreed that it is not possible to address the memory footprint due to My original code was only stripping off PFNs from final doc, how the rest of the structure is handling should be a business of this PR. |
Valentin, my last comment does not have anything to do with memory footprint. In short, it says that the current - and/or with this patch in - schema uploaded to WMArchive is not functional. In addition to that, I still see the input LFN in your dictionary. From the original issue description - which had some discussion and was still evolving - this: was one of the expected outcome/behavior for that issue. |
Alan, we can't just strip out LFNs since your pointer does not address all use-cases. For instance, primary reason for keeping LFNArray is to provide logArchive look-up on HDFS via Spark job. As such, I have no idea which LFNs are required for this use-case. In addition to that I have no clue how many types of LFNs FWJR contains. Does your list:
represents all possible LFNs created, used in WMCore workflows? Therefore, if you want to address this properly please provide the following:
In my view the LFNs and PFNs presence in WMArchive should be addressed in different PRs. This PR only address removal of PFNs which will already reduce WMArchive size. Even though it is partial solution it is better to have it first, while addressing LFNs may require much broader discussion among different groups and supporting different use-cases. |
@amaltaro based on today's discussion please provide clear guidelines how to proceed with this PR. So far, it removes PFNs. But our discussion, to avoid ambiguities, it would be easy to remove LFNs as well. If this is the case, please say so and I'll adjust the PR accordingly. |
We have many FJR samples in here: which all date back from 6 years ago. They should be fairly representative, even though I wanted to see if we could have one with In any case, would you be able to provide us with the following samples:
converted to WMArchive format with a) this patch applied; and b) without this patch? So, 4 json files in total. |
Here we go:
|
And, here is test code to produce the aforementioned data:
The code can be run as following (assuming you save it as
|
Valentin, as discussed on Monday, this is indeed removing the PFNs from the output document and it's good to go. Eventually we should review how the list of files (input/output, lfns/pfns) is structured in the WMArchive document. We should backport this fix to 2.0.2_wmagent and patch all the production agents running the latest version. |
Fixes #10879
Status
ready
Description
Remove PFNArray from final WMArchive document
Is it backward compatible (if not, which system it affects?)
YES, since we do not fill out PFNArray part
Related PRs
<If it's a follow up work; or porting a fix from a different branch, please mention them here.>
External dependencies / deployment changes
<Does it require deployment changes? Does it rely on third-party libraries?>