Remove PFNs from a final document we send to WMArchive #10998

vkuznet · 2022-02-15T15:33:09Z

Fixes #10879

Status

ready

Description

Remove PFNArray from final WMArchive document

Is it backward compatible (if not, which system it affects?)

YES, since we do not fill out PFNArray part

Related PRs

External dependencies / deployment changes

vkuznet · 2022-02-15T15:34:50Z

@amaltaro based on my understanding of the codebase we only need trivial change. So far I only removed PFN part of selected keys which in fact remove PFNArray and PFNArrayRef attributes in a final document. I don't know if you want to remove LFNArray too, the change will be trivial as well.

cmsdmwmbot · 2022-02-15T15:41:33Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 4 changes in unstable tests
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12786/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet Valentin, can you please provide a diff of the final document with and without this patch? There might be unit tests to help you with that.

When I looked at this code, I found it very confusing and sometimes the same key is present in multiple level of the final dict/json.

vkuznet · 2022-02-15T17:36:29Z

The diff is tricky since the data is randomly generated, but here are the examples of docs with and without PFNs:

amaltaro · 2022-02-15T20:23:46Z

This data format conversion looks blurred to me. Reason why I started looking at the unit tests to make more sense of the outcome document.

For instance, the LFNArrayRef/PFNArrayRef seem to be like an index of attributes from the LFNArray/PFNArray lists. But then, when the later is shorter than the former how do you know which attribute type is missing in the output.

Also looking into the two json files that you shared, I see some strange output for the nested "pfn" key. Now it lists the full PFN instead of its index. So the current fix doesn't seem to make it too much better.

amaltaro · 2022-03-10T19:35:34Z

@vkuznet I just wanted to check whether there is any information that you need from me here? Looking at my previous reply, I think there are still some inconsistencies in the final document that it would be better to understand and get it fixed.

vkuznet · 2022-03-15T12:51:28Z

@amaltaro , I added new function to remove all PFNs, please see final doc here. Now the code is ready for review again.

cmsdmwmbot · 2022-03-15T13:02:09Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12872/artifact/artifacts/PullRequestReport.html

vkuznet · 2022-03-17T17:46:52Z

@amaltaro could you please review it again, as I wrote I adjusted code to completely remove PFNs, you can see it in new fwjr-no-pfns.json JSON file I produced with new code.

amaltaro

Valentin, thanks for providing this new approach.

However, even though it does the job, it doesn't look an efficient way to deal with this. Reason is, we actually run the whole logic mapping FJR information into a WMArchive document - expanding everything in memory - to only then go back to the WMArchive document and remove key/value pairs that we do not need.

The proper way to fix it IMO would be, to convert only the necessary information into a WMArchive document. This will save a lot of memory and CPU cycles (and to be frank, it will likely become a less complex logic).

vkuznet · 2022-03-18T12:41:35Z

@amaltaro , do you have quantitative confirmation about excessive memory and CPU footprint? We run with PFNs for years and I never heard anything bothering you or other in terms of submission. My point is that I tried a simple approach without spending too much time on optimization. If you have specific numbers to show that it takes that much CPU and that much memory I think it is not worth the effort. If you still insist to make it in a proper way, I'll do it, but it will take time and change of algorithm/code logic.

amaltaro · 2022-03-18T14:16:34Z

Unfortunately I do not! I also believe that collecting such metrics could actually take more time than the actual implementation :)

Please don't take me as a picky person, I am just trying to avoid to end up with issues like these:
#8478
#4656

where things are functional, but fail to scale in specific situations.

A few weeks ago I looked into this code myself, and indeed it's quite complex and hard to follow. Which is another reason I would be in favor of actually not doing unnecessary things and make it cleaner and more readable.

vkuznet · 2022-03-18T17:35:23Z

Alan, ok, I added new function cleanStep which removes provided by FWJR pfn related fields. The produced document no longer have PFNs, but I'm not sure if it is sufficient to address problems you're referring since FWJR which is passed to WMCore/Services/WMArchive/DataMap.py still may contain PFNs attribute in FWJR document. I have no idea where exactly FWJR is created and who is passing it to this module. Please review and let me know if it is sufficient. Otherwise, please point me to the place where in WMCore FWJR is constructed such that we can remove PFNs (if necessary) from there.

cmsdmwmbot · 2022-03-18T17:55:40Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 3 changes in unstable tests
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12907/artifact/artifacts/PullRequestReport.html

amaltaro

Otherwise, please point me to the place where in WMCore FWJR is constructed such that we can remove PFNs (if necessary) from there.

There is nothing that needs to be done with the FJR. It's only the source information and the data structure that is parsed in order to produce a WMArchive job document.

Having said that, the optimal change here is to change the "converter" code, such that it disregard input LFNs and PFNs. In other words, instead of cleaning these attributes from the WMArchive job document, we should actually make sure that they don't even get (temporary) created in the document.

vkuznet · 2022-03-22T23:43:47Z

Alan, I hope you read my changes, they do exactly what you said. I added new function convertStep which is called from convertSteps which by itself is called from convertToArchiverFormat (see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L320) which by itself takes given FWJR. Unless you clarify your statement further I see no further action I need to do with this PR. Please spend time to read changes and code itself. I only left cleanup function which I introduced earlier in case if you want to do cleanup step at the end. This function can be removed once you review the proposed changes.

vkuznet · 2022-03-23T11:06:04Z

Alan, in addition what I wrote yesterday, I need to say that you're mixing two different issues here. One, which is issue of not sending large docs to MONIT is addressed in this PR, the second issue is (possible) large memory footprint you pointed out. Because current code does deepcopy at line https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L318 from a given FWJR the memory issue is still present. As I pointed out already several times if you want to address the memory issue we either need to:

remove copy.deepcopy call, but in this case I don't know what will happen with provided FWJR after data send to MONIT, i.e. if we remove certain things from provided dict and later they are used by something else in WMCore this will be wrong, or
we should modify FWJR creation process to avoid putting into its doc PFNs

Therefore, we need to separate these two issues, and if memory footprint is bothering you I suggest to open different ticket for it and it should be handled from the view of handling FWJR docs.

amaltaro

Yes, if lfn information is no longer in the WMArchive document, then it does solve the problem reported in the GH issue. However, to me it looks more like a workaround than a real fix.

What I would expect from this code (not necessarily your changes) is:

load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

instead, we have either (your 1st and 2nd attempt):

load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc -> delete unnecessary fields -> create final WMArchive doc; OR
load FJR -> delete specific fields -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

Don't you think the very top/former model is THE correct way to fix it?

This whole code is hard to read and lack of unit tests make these changes even more challenging (well, there is 1 "unit" test that checks the whole module in shot). That's why I was advocating for an approach that would go straight to the final result - likely making it less complex.

I will try to make sense of this by Monday morning...

amaltaro · 2022-03-25T20:39:21Z

src/python/WMCore/Services/WMArchive/DataMap.py

@@ -13,18 +13,17 @@
 # convert data format under stpes["cmsRun1"/"logArch1"/"stageOut1"]["output"]
 WMARCHIVE_REMOVE_OUTSIDE_LAYER = ["checksum", "dataset"]
 # convert to list from str
-WMARCHIVE_CONVERT_TO_LIST = ["OutputPFN", "lfn"]


I don't know if this information could be consumed by anyone, but I'd say OutputPFN should remain in the WMA doc, no?

amaltaro · 2022-03-25T20:41:07Z

src/python/WMCore/Services/WMArchive/DataMap.py

+        data = idict.get(step, {})
+        for key, values in data.items():
+            for elem in values:
+                for skip in ['pfn', 'InputPFN', 'OutputPFN', 'inputpfns']:


Shouldn't this be defined in the WMARCHIVE_REMOVE_FIELD variable? Maybe not, because it's actually removing keys from the FJR, right? Still, this would be better to be defined at the top.

Maybe move it up in the module under FWJR_REMOVE_STEP_FIELD?

vkuznet · 2022-03-25T22:08:31Z

Alan, I think you contradict yourself. Originally, you asked to avoid unnecessary memory consumption by which can only be done before the load FWJR step. Once you load it, the memory is already allocated and consumer. Therefore, your desired approach:

load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

is identical to

load FJR -> delete specific fields -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

because the PFNs already exists in FWJR as far as I can tell. Don't you think so? Both these above approaches will be identical in terms of memory consumption since deepcopy will take your RAM by the time it will load FJR.

I don't think that full re-write is answer here. In fact, the second commit I made does exactly what you ask for, it does not delete specific fields explicitly, it skips them, i.e. it is making perform fields selection and convert to WMArchive schema. Once again, please read what def cleanStep is doing and if you still believe that it deletes specific fields point me how.

Anyway, I think this issue require a chat since it seems to me stuck with understanding how to address it.

amaltaro · 2022-03-25T22:30:02Z

because the PFNs already exists in FWJR as far as I can tell. Don't you think so? Both these above approaches will be identical in terms of memory consumption since deepcopy will take your RAM by the time it will load FJR.

Now that I spent 30min looking into this code, my conclusion is that it's not worth it to go deeper. If we do so, then this should be refactored.

Yes, I agree that everything is already loaded in memory and duplicated with the deepcopy, whatever we do now will remain in memory until things go out of scope (next cycle).

Valentin, can you please look into the comments made along the code, update it if required, and provide me with a new dump of the outcome (as you've done initially in this PR)?

vkuznet · 2022-03-26T14:42:56Z

Alan, I clean-up code, i.e. removed cleanup function I committed originally, moved (per your request) cleanStep function to top of the file (after module definitions), fixed docstring. The new doc can be seen here where I posted first dict output from the following command:

python test/python/WMCore_t/Services_t/WMArchive_t/DataMap_t.py

Please note, this command generates several docs which are dumped to the stdout. I checked them all for pfn appearance which is no longer there and put to the gist the first dict.

Please review the code again.

cmsdmwmbot · 2022-03-26T14:50:39Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 1 warnings
- 31 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12960/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-03-26T15:20:41Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 1 warnings
- 31 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12961/artifact/artifacts/PullRequestReport.html

vkuznet · 2022-03-29T16:22:34Z

Alan, do you need anything else on this issue?

amaltaro

Thanks for providing these further details, Valentin.

Looking into your gist/dict, this is the reason why I really think we should rework this document conversion:

  "LFNArray": [
    "/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
    "/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/E8099605-8853-E011-A848-0030487A18F2.root",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
  ],
  "LFNArrayRef": [
    "outputLFNs",
    "lfn",
    "skippedFiles",
    "inputLFNs",
    "fallbackFiles"
  ],

even if that means a refactoring of the code.

If I understand it correctly - and this output reminds me a couple of the old PhEDEx APIs - where LFNArrayRef is an ordered list of attributes to be used as a reference for the LFNArray list, such that:

LFNArray[0] should be a outputLFNs
LFNArray[1] should be a lfn
LFNArray[2] should be a skippedFiles; and so on.

which clearly is not clear at all(!)

Maybe the expected outcome should be something like:

  "LFNArray": [
    "/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
  ],
  "LFNArrayRef": [
    "outputLFNs",
  ],

thus, reporting only the output files for this job (2 ALCARECO files and the logArchive).

vkuznet · 2022-03-29T18:54:55Z

Alan, this getting longer and longer list of changes you desire to have which have nothing to do with original goal of this PR. If you agreed that it is not possible to address the memory footprint due to deepcopy I really suggest to come back to my original cleanup function which just strip out PFNs from final document. Everything else, if necessary, should go in my opinion into separate PRs/issues. I don't want to waste time on refactoring of something which we may or may not use, and making gigantic changes and refactoring code to the issues which far beyond of the scope of original problem on a monitoring side. Don't you agree?

My original code was only stripping off PFNs from final doc, how the rest of the structure is handling should be a business of this PR.

amaltaro · 2022-03-29T19:01:21Z

Valentin, my last comment does not have anything to do with memory footprint. In short, it says that the current - and/or with this patch in - schema uploaded to WMArchive is not functional. In addition to that, I still see the input LFN in your dictionary.

From the original issue description - which had some discussion and was still evolving - this:
#10879 (comment)

was one of the expected outcome/behavior for that issue.

vkuznet · 2022-03-29T19:39:03Z

Alan, we can't just strip out LFNs since your pointer does not address all use-cases. For instance, primary reason for keeping LFNArray is to provide logArchive look-up on HDFS via Spark job. As such, I have no idea which LFNs are required for this use-case. In addition to that I have no clue how many types of LFNs FWJR contains. Does your list:

    "outputLFNs",
    "lfn",
    "skippedFiles",
    "inputLFNs",
    "fallbackFiles"

represents all possible LFNs created, used in WMCore workflows?

Therefore, if you want to address this properly please provide the following:

full schema of FWJR, especially whatever attributes related to LFNs
please outline use-cases which we should support, e.g. does LFNArrayRef should only contain outputLFNs or any other kind which will be required by use-cases we need to support
- use cases should cover existing WMCore and data-ops ones which (again) I have no idea where to get
does LFNArray should keep all used LFNs or only subset, if later, which one

In my view the LFNs and PFNs presence in WMArchive should be addressed in different PRs. This PR only address removal of PFNs which will already reduce WMArchive size. Even though it is partial solution it is better to have it first, while addressing LFNs may require much broader discussion among different groups and supporting different use-cases.

vkuznet · 2022-04-18T17:37:58Z

@amaltaro based on today's discussion please provide clear guidelines how to proceed with this PR. So far, it removes PFNs. But our discussion, to avoid ambiguities, it would be easy to remove LFNs as well. If this is the case, please say so and I'll adjust the PR accordingly.

amaltaro · 2022-04-18T17:53:44Z

We have many FJR samples in here:
https://github.com/dmwm/WMCore/tree/master/test/python/WMCore_t/Services_t/WMArchive_t/FWJRSamples

which all date back from 6 years ago. They should be fairly representative, even though I wanted to see if we could have one with skippedFiles.

In any case, would you be able to provide us with the following samples:

ProcessingSuccessFwjr.json
MergeSuccessFwjr.json

converted to WMArchive format with a) this patch applied; and b) without this patch? So, 4 json files in total.

vkuznet · 2022-04-18T18:32:54Z

Here we go:

no patch
- MergeSuccessFwjr-pfns.json
- ProcessingSuccessFwjr-pfns.json
with current PR patch
- MergeSuccessFwjr-nopfns.json
- ProcessingSuccessFwjr-nopfns.json

vkuznet · 2022-04-18T18:34:35Z

And, here is test code to produce the aforementioned data:

#!/usr/bin/env python

import sys
import json
from WMCore.Services.WMArchive.DataMap import createArchiverDoc

fwjr = json.load(open(sys.argv[1]))
doc = {'id': 123, 'doc': fwjr}
data = createArchiverDoc(doc)
print(json.dumps(data))

The code can be run as following (assuming you save it as test-wma.py):

./test-wma.py test/python/WMCore_t/Services_t/WMArchive_t/FWJRSamples/MergeSuccessFwjr.json | jq > MergeSuccessFwjr-pfns.json

amaltaro · 2022-04-21T01:50:14Z

Valentin, as discussed on Monday, this is indeed removing the PFNs from the output document and it's good to go. Eventually we should review how the list of files (input/output, lfns/pfns) is structured in the WMArchive document.

We should backport this fix to 2.0.2_wmagent and patch all the production agents running the latest version.

vkuznet requested a review from amaltaro February 15, 2022 15:35

amaltaro reviewed Feb 15, 2022

View reviewed changes

vkuznet mentioned this pull request Feb 15, 2022

Big jsons uploaded to WMArchive #10879

Closed

amaltaro added the PR: Do not merge yet label Mar 3, 2022

vkuznet force-pushed the issue-10879 branch from 9d5fe83 to df88caa Compare March 15, 2022 12:49

vkuznet requested a review from amaltaro March 15, 2022 12:51

amaltaro requested changes Mar 18, 2022

View reviewed changes

vkuznet force-pushed the issue-10879 branch from df88caa to 2245e41 Compare March 18, 2022 17:31

vkuznet requested a review from amaltaro March 18, 2022 17:35

vkuznet self-assigned this Mar 22, 2022

vkuznet added the PR: Work in progress label Mar 22, 2022

amaltaro reviewed Mar 22, 2022

View reviewed changes

amaltaro requested changes Mar 25, 2022

View reviewed changes

vkuznet force-pushed the issue-10879 branch from 2245e41 to ea66013 Compare March 26, 2022 14:40

vkuznet requested a review from amaltaro March 26, 2022 14:43

Remove PFNs from a final document we send to WMArchive

1d0ef2d

vkuznet force-pushed the issue-10879 branch from ea66013 to 1d0ef2d Compare March 26, 2022 15:12

amaltaro reviewed Mar 29, 2022

View reviewed changes

amaltaro merged commit 8c6466b into dmwm:master Apr 21, 2022

Remove PFNs from a final document we send to WMArchive #10998

Remove PFNs from a final document we send to WMArchive #10998

Conversation

vkuznet commented Feb 15, 2022

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

vkuznet commented Feb 15, 2022

cmsdmwmbot commented Feb 15, 2022

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Feb 15, 2022

amaltaro commented Feb 15, 2022

amaltaro commented Mar 10, 2022

vkuznet commented Mar 15, 2022

cmsdmwmbot commented Mar 15, 2022

vkuznet commented Mar 17, 2022

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Mar 18, 2022 • edited Loading

amaltaro commented Mar 18, 2022

vkuznet commented Mar 18, 2022

cmsdmwmbot commented Mar 18, 2022

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Mar 22, 2022

vkuznet commented Mar 23, 2022

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Mar 25, 2022

Choose a reason for hiding this comment

amaltaro Mar 25, 2022

Choose a reason for hiding this comment

amaltaro Mar 25, 2022

Choose a reason for hiding this comment

vkuznet commented Mar 25, 2022

amaltaro commented Mar 25, 2022

vkuznet commented Mar 26, 2022

cmsdmwmbot commented Mar 26, 2022

cmsdmwmbot commented Mar 26, 2022

vkuznet commented Mar 29, 2022

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Mar 29, 2022

amaltaro commented Mar 29, 2022

vkuznet commented Mar 29, 2022

vkuznet commented Apr 18, 2022

amaltaro commented Apr 18, 2022

vkuznet commented Apr 18, 2022

vkuznet commented Apr 18, 2022

amaltaro commented Apr 21, 2022

vkuznet commented Mar 18, 2022 •

edited

Loading