Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PFNs from a final document we send to WMArchive #10998

Merged
merged 1 commit into from
Apr 21, 2022

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Feb 15, 2022

Fixes #10879

Status

ready

Description

Remove PFNArray from final WMArchive document

Is it backward compatible (if not, which system it affects?)

YES, since we do not fill out PFNArray part

Related PRs

<If it's a follow up work; or porting a fix from a different branch, please mention them here.>

External dependencies / deployment changes

<Does it require deployment changes? Does it rely on third-party libraries?>

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 15, 2022

@amaltaro based on my understanding of the codebase we only need trivial change. So far I only removed PFN part of selected keys which in fact remove PFNArray and PFNArrayRef attributes in a final document. I don't know if you want to remove LFNArray too, the change will be trivial as well.

@vkuznet vkuznet requested a review from amaltaro February 15, 2022 15:35
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12786/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, can you please provide a diff of the final document with and without this patch? There might be unit tests to help you with that.

When I looked at this code, I found it very confusing and sometimes the same key is present in multiple level of the final dict/json.

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 15, 2022

The diff is tricky since the data is randomly generated, but here are the examples of docs with and without PFNs:

@amaltaro
Copy link
Contributor

This data format conversion looks blurred to me. Reason why I started looking at the unit tests to make more sense of the outcome document.

For instance, the LFNArrayRef/PFNArrayRef seem to be like an index of attributes from the LFNArray/PFNArray lists. But then, when the later is shorter than the former how do you know which attribute type is missing in the output.

Also looking into the two json files that you shared, I see some strange output for the nested "pfn" key. Now it lists the full PFN instead of its index. So the current fix doesn't seem to make it too much better.

@amaltaro
Copy link
Contributor

@vkuznet I just wanted to check whether there is any information that you need from me here? Looking at my previous reply, I think there are still some inconsistencies in the final document that it would be better to understand and get it fixed.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 15, 2022

@amaltaro , I added new function to remove all PFNs, please see final doc here. Now the code is ready for review again.

@vkuznet vkuznet requested a review from amaltaro March 15, 2022 12:51
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12872/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 17, 2022

@amaltaro could you please review it again, as I wrote I adjusted code to completely remove PFNs, you can see it in new fwjr-no-pfns.json JSON file I produced with new code.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, thanks for providing this new approach.

However, even though it does the job, it doesn't look an efficient way to deal with this. Reason is, we actually run the whole logic mapping FJR information into a WMArchive document - expanding everything in memory - to only then go back to the WMArchive document and remove key/value pairs that we do not need.

The proper way to fix it IMO would be, to convert only the necessary information into a WMArchive document. This will save a lot of memory and CPU cycles (and to be frank, it will likely become a less complex logic).

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 18, 2022

@amaltaro , do you have quantitative confirmation about excessive memory and CPU footprint? We run with PFNs for years and I never heard anything bothering you or other in terms of submission. My point is that I tried a simple approach without spending too much time on optimization. If you have specific numbers to show that it takes that much CPU and that much memory I think it is not worth the effort. If you still insist to make it in a proper way, I'll do it, but it will take time and change of algorithm/code logic.

@amaltaro
Copy link
Contributor

Unfortunately I do not! I also believe that collecting such metrics could actually take more time than the actual implementation :)

Please don't take me as a picky person, I am just trying to avoid to end up with issues like these:
#8478
#4656

where things are functional, but fail to scale in specific situations.

A few weeks ago I looked into this code myself, and indeed it's quite complex and hard to follow. Which is another reason I would be in favor of actually not doing unnecessary things and make it cleaner and more readable.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 18, 2022

Alan, ok, I added new function cleanStep which removes provided by FWJR pfn related fields. The produced document no longer have PFNs, but I'm not sure if it is sufficient to address problems you're referring since FWJR which is passed to WMCore/Services/WMArchive/DataMap.py still may contain PFNs attribute in FWJR document. I have no idea where exactly FWJR is created and who is passing it to this module. Please review and let me know if it is sufficient. Otherwise, please point me to the place where in WMCore FWJR is constructed such that we can remove PFNs (if necessary) from there.

@vkuznet vkuznet requested a review from amaltaro March 18, 2022 17:35
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12907/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, please point me to the place where in WMCore FWJR is constructed such that we can remove PFNs (if necessary) from there.

There is nothing that needs to be done with the FJR. It's only the source information and the data structure that is parsed in order to produce a WMArchive job document.

Having said that, the optimal change here is to change the "converter" code, such that it disregard input LFNs and PFNs. In other words, instead of cleaning these attributes from the WMArchive job document, we should actually make sure that they don't even get (temporary) created in the document.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 22, 2022

Alan, I hope you read my changes, they do exactly what you said. I added new function convertStep which is called from convertSteps which by itself is called from convertToArchiverFormat (see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L320) which by itself takes given FWJR. Unless you clarify your statement further I see no further action I need to do with this PR. Please spend time to read changes and code itself. I only left cleanup function which I introduced earlier in case if you want to do cleanup step at the end. This function can be removed once you review the proposed changes.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 23, 2022

Alan, in addition what I wrote yesterday, I need to say that you're mixing two different issues here. One, which is issue of not sending large docs to MONIT is addressed in this PR, the second issue is (possible) large memory footprint you pointed out. Because current code does deepcopy at line https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L318 from a given FWJR the memory issue is still present. As I pointed out already several times if you want to address the memory issue we either need to:

  • remove copy.deepcopy call, but in this case I don't know what will happen with provided FWJR after data send to MONIT, i.e. if we remove certain things from provided dict and later they are used by something else in WMCore this will be wrong, or
  • we should modify FWJR creation process to avoid putting into its doc PFNs

Therefore, we need to separate these two issues, and if memory footprint is bothering you I suggest to open different ticket for it and it should be handled from the view of handling FWJR docs.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if lfn information is no longer in the WMArchive document, then it does solve the problem reported in the GH issue. However, to me it looks more like a workaround than a real fix.

What I would expect from this code (not necessarily your changes) is:

  • load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

instead, we have either (your 1st and 2nd attempt):

  • load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc -> delete unnecessary fields -> create final WMArchive doc; OR
  • load FJR -> delete specific fields -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

Don't you think the very top/former model is THE correct way to fix it?

This whole code is hard to read and lack of unit tests make these changes even more challenging (well, there is 1 "unit" test that checks the whole module in shot). That's why I was advocating for an approach that would go straight to the final result - likely making it less complex.

I will try to make sense of this by Monday morning...

@@ -13,18 +13,17 @@
# convert data format under stpes["cmsRun1"/"logArch1"/"stageOut1"]["output"]
WMARCHIVE_REMOVE_OUTSIDE_LAYER = ["checksum", "dataset"]
# convert to list from str
WMARCHIVE_CONVERT_TO_LIST = ["OutputPFN", "lfn"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this information could be consumed by anyone, but I'd say OutputPFN should remain in the WMA doc, no?

data = idict.get(step, {})
for key, values in data.items():
for elem in values:
for skip in ['pfn', 'InputPFN', 'OutputPFN', 'inputpfns']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be defined in the WMARCHIVE_REMOVE_FIELD variable? Maybe not, because it's actually removing keys from the FJR, right? Still, this would be better to be defined at the top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move it up in the module under FWJR_REMOVE_STEP_FIELD?

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 25, 2022

Alan, I think you contradict yourself. Originally, you asked to avoid unnecessary memory consumption by which can only be done before the load FWJR step. Once you load it, the memory is already allocated and consumer. Therefore, your desired approach:

load FJR -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

is identical to

load FJR -> delete specific fields -> perform fields selection and convert to WMArchive schema -> create WMArchive doc

because the PFNs already exists in FWJR as far as I can tell. Don't you think so? Both these above approaches will be identical in terms of memory consumption since deepcopy will take your RAM by the time it will load FJR.

I don't think that full re-write is answer here. In fact, the second commit I made does exactly what you ask for, it does not delete specific fields explicitly, it skips them, i.e. it is making perform fields selection and convert to WMArchive schema. Once again, please read what def cleanStep is doing and if you still believe that it deletes specific fields point me how.

Anyway, I think this issue require a chat since it seems to me stuck with understanding how to address it.

@amaltaro
Copy link
Contributor

because the PFNs already exists in FWJR as far as I can tell. Don't you think so? Both these above approaches will be identical in terms of memory consumption since deepcopy will take your RAM by the time it will load FJR.

Now that I spent 30min looking into this code, my conclusion is that it's not worth it to go deeper. If we do so, then this should be refactored.

Yes, I agree that everything is already loaded in memory and duplicated with the deepcopy, whatever we do now will remain in memory until things go out of scope (next cycle).

Valentin, can you please look into the comments made along the code, update it if required, and provide me with a new dump of the outcome (as you've done initially in this PR)?

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 26, 2022

Alan, I clean-up code, i.e. removed cleanup function I committed originally, moved (per your request) cleanStep function to top of the file (after module definitions), fixed docstring. The new doc can be seen here where I posted first dict output from the following command:

python test/python/WMCore_t/Services_t/WMArchive_t/DataMap_t.py

Please note, this command generates several docs which are dumped to the stdout. I checked them all for pfn appearance which is no longer there and put to the gist the first dict.

Please review the code again.

@vkuznet vkuznet requested a review from amaltaro March 26, 2022 14:43
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 1 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12960/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12961/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 29, 2022

Alan, do you need anything else on this issue?

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing these further details, Valentin.

Looking into your gist/dict, this is the reason why I really think we should rework this document conversion:

  "LFNArray": [
    "/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
    "/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/E8099605-8853-E011-A848-0030487A18F2.root",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
  ],
  "LFNArrayRef": [
    "outputLFNs",
    "lfn",
    "skippedFiles",
    "inputLFNs",
    "fallbackFiles"
  ],

even if that means a refactoring of the code.

If I understand it correctly - and this output reminds me a couple of the old PhEDEx APIs - where LFNArrayRef is an ordered list of attributes to be used as a reference for the LFNArray list, such that:

  • LFNArray[0] should be a outputLFNs
  • LFNArray[1] should be a lfn
  • LFNArray[2] should be a skippedFiles; and so on.

which clearly is not clear at all(!)

Maybe the expected outcome should be something like:

  "LFNArray": [
    "/store/unmerged/logs/prod/2016/2/4/sryu_TaskChain_Data_wq_testt_160204_061048_5587/RECOCOSD/0000/0/7d7d41dc-cb02-11e5-833c-02163e00efd5-88-0-logArchive.tar.gz",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/DtCalib-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/ECCFE421-08CB-E511-9F4C-02163E017804.root",
    "/store/unmerged/CMSSW_7_0_0_pre11/Cosmics/ALCARECO/MuAlCalIsolatedMu-RECOCOSD_TaskChain_Data_pile_up_test-v1/00000/9665EB21-08CB-E511-9F4C-02163E017804.root"
  ],
  "LFNArrayRef": [
    "outputLFNs",
  ],

thus, reporting only the output files for this job (2 ALCARECO files and the logArchive).

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 29, 2022

Alan, this getting longer and longer list of changes you desire to have which have nothing to do with original goal of this PR. If you agreed that it is not possible to address the memory footprint due to deepcopy I really suggest to come back to my original cleanup function which just strip out PFNs from final document. Everything else, if necessary, should go in my opinion into separate PRs/issues. I don't want to waste time on refactoring of something which we may or may not use, and making gigantic changes and refactoring code to the issues which far beyond of the scope of original problem on a monitoring side. Don't you agree?

My original code was only stripping off PFNs from final doc, how the rest of the structure is handling should be a business of this PR.

@amaltaro
Copy link
Contributor

Valentin, my last comment does not have anything to do with memory footprint. In short, it says that the current - and/or with this patch in - schema uploaded to WMArchive is not functional. In addition to that, I still see the input LFN in your dictionary.

From the original issue description - which had some discussion and was still evolving - this:
#10879 (comment)

was one of the expected outcome/behavior for that issue.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 29, 2022

Alan, we can't just strip out LFNs since your pointer does not address all use-cases. For instance, primary reason for keeping LFNArray is to provide logArchive look-up on HDFS via Spark job. As such, I have no idea which LFNs are required for this use-case. In addition to that I have no clue how many types of LFNs FWJR contains. Does your list:

    "outputLFNs",
    "lfn",
    "skippedFiles",
    "inputLFNs",
    "fallbackFiles"

represents all possible LFNs created, used in WMCore workflows?

Therefore, if you want to address this properly please provide the following:

  • full schema of FWJR, especially whatever attributes related to LFNs
  • please outline use-cases which we should support, e.g. does LFNArrayRef should only contain outputLFNs or any other kind which will be required by use-cases we need to support
    • use cases should cover existing WMCore and data-ops ones which (again) I have no idea where to get
  • does LFNArray should keep all used LFNs or only subset, if later, which one

In my view the LFNs and PFNs presence in WMArchive should be addressed in different PRs. This PR only address removal of PFNs which will already reduce WMArchive size. Even though it is partial solution it is better to have it first, while addressing LFNs may require much broader discussion among different groups and supporting different use-cases.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2022

@amaltaro based on today's discussion please provide clear guidelines how to proceed with this PR. So far, it removes PFNs. But our discussion, to avoid ambiguities, it would be easy to remove LFNs as well. If this is the case, please say so and I'll adjust the PR accordingly.

@amaltaro
Copy link
Contributor

We have many FJR samples in here:
https://github.com/dmwm/WMCore/tree/master/test/python/WMCore_t/Services_t/WMArchive_t/FWJRSamples

which all date back from 6 years ago. They should be fairly representative, even though I wanted to see if we could have one with skippedFiles.

In any case, would you be able to provide us with the following samples:

  • ProcessingSuccessFwjr.json
  • MergeSuccessFwjr.json

converted to WMArchive format with a) this patch applied; and b) without this patch? So, 4 json files in total.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2022

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2022

And, here is test code to produce the aforementioned data:

#!/usr/bin/env python

import sys
import json
from WMCore.Services.WMArchive.DataMap import createArchiverDoc

fwjr = json.load(open(sys.argv[1]))
doc = {'id': 123, 'doc': fwjr}
data = createArchiverDoc(doc)
print(json.dumps(data))

The code can be run as following (assuming you save it as test-wma.py):

./test-wma.py test/python/WMCore_t/Services_t/WMArchive_t/FWJRSamples/MergeSuccessFwjr.json | jq > MergeSuccessFwjr-pfns.json

@amaltaro
Copy link
Contributor

Valentin, as discussed on Monday, this is indeed removing the PFNs from the output document and it's good to go. Eventually we should review how the list of files (input/output, lfns/pfns) is structured in the WMArchive document.

We should backport this fix to 2.0.2_wmagent and patch all the production agents running the latest version.

@amaltaro amaltaro merged commit 8c6466b into dmwm:master Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Big jsons uploaded to WMArchive
3 participants