Fix WMTiming metrics in FJR pkl file #11726

vkuznet · 2023-09-19T12:17:33Z

Fixes #11718 (follow up from #11604)
Complement to #11656

Status

ready

Description

Fixes propagation of WMMetrics into Report pkl files

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11656

External dependencies / deployment changes

vkuznet · 2023-09-19T12:20:42Z

I generalize adding WMTiming metrics to all Report.pkl files within condor job execution (this is done via find command which finds all Report.pkl file and adds WMTiming metrics to it).
the exit status 2 refers to missing Timestamps.py module since it is part of WMCore.zip shipped to the node. Therefore, I relocated this script under etc area of WMCore such that it will be available on worker node
Please note that on worker node we ship WMCore.zip which is appended to PYTHONPATH but since it is not guaranteed that unzip will be available on worker node we can't extract anything from this archive.
I fixed logic of Timestamps.py module which pass config into helper function but it is passed by value, i.e. it is local copy, and never being updated. I also relaxed condition to compare sizes as os.path.getsize is not perfect to determine size of the Python objects and added config may fit into padding of existing config Python object.

Here is a link to recent Report.0.pkl produced by one of the jobs I run. It properly contains WMTiming metrics in it, see

...
FrameworkJobReport.section_('WMTiming')
FrameworkJobReport.WMTiming.WMJobStart = 1695218580.0
FrameworkJobReport.WMTiming.WMTotalWallClockTime = 15.0
FrameworkJobReport.WMTiming.WMJobEnd = 1695218595.0
...

and it is properly appears in couchDB FJR (see curl xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1308-0 | jq):

    "WMTiming": {
      "WMJobStart": 1695218580,
      "WMTotalWallClockTime": 15,
      "WMJobEnd": 1695218595
    },

You can find this report over here.

And, finally I see now documents in OpenSearch/ES, e.g. here is one with WMTiming metrics.

cmsdmwmbot · 2023-09-19T12:30:26Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14489/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-19T12:35:44Z

I found that one unit test fails:

File "/build/cmsbld/jenkins/workspace/DMWM-WMCorePy3-PR-unittests/SLICE/2/label/cms-dmwm-cc7/code/test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py", line 264, in testSubprocessTime
    self.assertTrue(userTime1 > userTime0)

which is random failure as it depends on times. It will be trivial to fix this test by changing userTime1 > userTime0 condition to userTime1 >= userTime0 but it is out of scope of this PR. Please let me know if I should do it.

amaltaro · 2023-09-19T18:15:58Z

I generalize adding WMTiming metrics to all Report.pkl files within condor job execution (this is done via find command which finds all Report.pkl file and adds WMTiming metrics to it).

I don't know if you still has this code in, but this is an extremely wild workaround that I strongly disagree.

Given that we have already been working on this and learning many things, let us complete this phase and have the right fix at the right place. If there is really no other option, then we can re-evaluate something like that.

vkuznet · 2023-09-19T19:35:48Z

Alan, I never understand your logic of processing PRs, for those which does not have label in progress you complain about it, for those which does have this label you also review when are not explicitly asked for it. I don't understand it. This is work in progress where I collect my findings and make diary once I progress. If you want to review then wait for a request (this is the rule you insist :).

cmsdmwmbot · 2023-09-19T19:46:22Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests deleted
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14491/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-09-19T21:32:39Z

Oh, it wasn't really a review. I got a notification in my inbox and decided to provide my feedback on that comment before you start relying on it as a final product :-P

cmsdmwmbot · 2023-09-20T14:42:05Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 2 tests no longer failing
Python3 Pylint check: succeeded
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14493/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-20T14:58:43Z

@amaltaro , Alan, now this PR is ready for review. Please refer to my ticket comment #11726 (comment) which provides details of my investigation, code changes and reference to pkl json documents. Also, please refer to this comment #11726 (comment) if you want me to fix unstable CMSSW_t.py unit test (which comes and goes based on run conditions).

amaltaro · 2023-09-20T15:05:51Z

@vkuznet Valentin, yes, please provide a fix to the unstable unit test. Feel free to add it to this PR as a separate commit.

vkuznet · 2023-09-20T15:19:32Z

done withCMSSW_t.py too, will see how jenkins build will go.

cmsdmwmbot · 2023-09-20T15:29:04Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 2 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 22 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14494/artifact/artifacts/PullRequestReport.html

todor-ivanov

Thanks @vkuznet,

Modulo my comment about one redundant line, this looks good to me.

etc/submit_py3.sh

amaltaro

Please add the WMTiming section to the right Report pickle file, instead of overdoing it and adding it to every single report file that you can find in the node scratch area.

In addition, searching for the Timestamps.py file isn't optimal. Let us make it simple please and just stick to $pythonCommand Utils/Timestamps.py, provided that this path has been properly tested.

vkuznet · 2023-09-21T11:29:21Z

Alan, please clearly define what is a right file. So far I have no clue what it is. Based on my finding I see that we create multiple ones, one with Report.N.pkl where N is numerical, at top level area, others in WMTaskSpace. Based on this I did the right thing, i.e. I only modify Report.N.pkl, see

files=`find . -name "Report*.pkl" | grep -v WMTaskSpace`

but I do not touch report files in WMTaskSpace since I figured out they are intermediate ones. So, how many report files we have, or what is a range of N. So far I saw Report.0.pkl, but can we have Report.1.pkl and so on I have no idea. May be it depends on job kind or job processing status. Until you provide clear definition this is the best I can do.

Said that, now as we settle on location of Timestamps.py I can fix its path (again assuming that this path will not be changed in a future).

cmsdmwmbot · 2023-09-21T11:37:09Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 22 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14496/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-09-21T12:55:01Z

@vkuznet Valentin, the digit part of the Report.*.pkl file follows the actual WMAgent job retry (which is either one of the classads or a command line argument passed to the job wrapper). In other words, in a given grid job, we should NEVER see a Report.0.pkl and Report.1.pkl.

I see you already ruled out the report files under the WMTaskSpace), so we are one step closer to finally finding out which is the actual file that needs to be modified.

Please refer to my comment for further details: #11718 (comment) , which I believe our discussion was pointing that you would investigate which job report needs to be modified and make the proper fix.

vkuznet · 2023-09-21T13:39:16Z

Alan, you already confirmed that we may have multiple files, i.e. job may have multiple retries. In my view, each report pkl file in this case should contains this metrics. Once again I do not posses full knowledge about all possible use-cases of WMA workflows. I have no idea how many files should be create in different jobs, and therefore if you want to identify right series of files (since we may have multiple job retries) we need to go through all possible job workflows and identify for each workflow corresponding pkl file. This, I fear, will take long time to process and identify all use-case. Moreover, I expect that new cases may arise in a future. Bottom line, I think I provide THE BEST approach to identify top level report pkl files without subtasks accounting for multiple job retires. If you still disagree please provide the following:

outline each possible job workflow we have and provide corresponding input json for injection such that we can run those workflows and identify which files are created
please answer the question which pkl file should contain WMTiming metrics which are not since this metrics are at top level they can be applicable to all possible use-cases, e.g. full report, each individual step report, to log collection or archiving or storage steps. I do not know from monitoring point of view if we want to have timing metrics for each step, it can be, or not. Do we need these metrics for successful or aborted jobs, if latter, do we need to distinguish abort cases, e.g. failure to stage out, or failure of CMSSW job or failure of WMA component, and if this is the case do we write individual pkl files , if so which ones?

Once you clarify that I can start digging further, but I outlined that this will take lots of time, we may not cover all possible use-cases and we can't foresee all future use-cases. I still think that current changes cover everything we need without any additional overhead (neither at execution time since find comment is very precise to look for Report.*.pkl pattern without sub-tasks) or overhead of adding new metrics to pkl files since metrics are at top level and report pkl file I tested show appropriate Framework steps.

amaltaro · 2023-09-21T15:19:57Z

Valentin, please forget the job retry and re-read my previous comment.

The missing piece for WMTiming is to identify what is the actual report that needs to have this information? Again, please refer to this comment as well: #11718 (comment)

And the locations of job report is (copy/pasting from the other comment):

/srv/job/Report.0.pkl
/srv/Report.0.pkl # copied from WMTaskSpace/Report*.pkl at the end of the job wrapper
/srv/job/WMTaskSpace/Report.0.pkl
/srv/job/WMTaskSpace/STEP_NAME/Report.pkl # like cmsRun1, cmsRun2, logArch1, stageOut1

given that you are filtering out WMTaskSpace in your command, we can conclude that either /srv/Report.0.pkl or /srv/job/Report.0.pkl is responsible for carrying the WMTiming section. Which one is it?

Whether it is Report.0.pkl, or Report.1.pkl, or what, there is no need to run any find based discovery here. We have this information defined in the submit_py3 wrapper as:

outputFile="Report.$RETRY_NUM.pkl"
touch $outputFile

vkuznet · 2023-09-26T12:26:15Z

Alan, I identified that last report pkl file is the document to be used. It is copied from WMTaskSpace/Report*.pkl to top level destination. I adjusted the code (no find command anymore) and I rerun jobs. The data appears in MONIT, e.g. see this document. Or, you may check local couchDB using this workflow wmagent_small_MC_Agent0291_230921_215343_5650, e.g.

curl -s xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1363-0 | jq

cmsdmwmbot · 2023-09-26T12:30:03Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests deleted
- 1 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 22 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14514/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet Valentin, please investigate this deleted (?) unit test reported by jenkins:

    WMCore_t.WMRuntime_t.Scripts_t.Timestamps_t.TimestampsTest:testAddTimestamps (success) was deleted

It has not been deleted, so you probably have a broken import or something like that.

vkuznet · 2023-09-26T13:45:32Z

yes, it is because we moved Timestamps.py from runtime to utils area. I fixed the test and move it too to test/python/Utils_t area.

cmsdmwmbot · 2023-09-26T13:54:36Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 9 tests no longer failing
- 1 tests added
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 24 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14515/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet Valentin, instead of moving all those lines around in the job wrapper, I think we should simply pass the correct --reportFile value when executing Timestamps.py (see comment along the code).

etc/submit_py3.sh

vkuznet · 2023-09-26T19:03:43Z

With newly applied changes I do see WMTiming in WMArchvie, see this fresh document

cmsdmwmbot · 2023-09-26T19:14:11Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 9 tests no longer failing
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 24 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14517/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-09-26T19:49:01Z

@vkuznet Valentin, it's great that it works, but this is not what I suggested. I would rather not change how the wrapper script is organized right now. You should be able to accomplish the same results with simpler changes like:

--- a/etc/submit_py3.sh
+++ b/etc/submit_py3.sh
@@ -213,8 +213,10 @@ echo -e "======== WMAgent Run the job FINISH at $(TZ=GMT date) ========\n"
 
 echo "======== WMAgent add timestamps metrics to FJR $outputFile ========"
 timeEndSec=`date +%s`
-reportIn=$outputFile
-$pythonCommand Timestamps.py --reportFile=$outputFile --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec
+reportIn=WMTaskSpace/$outputFile
+tpy=$HOME/job/Utils/Timestamps.py
+echo -e "$pythonCommand $tpy --reportFile=$reportIn --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec"
+$pythonCommand $tpy --reportFile=$reportIn --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec
 status=$?
 if [ "$status" != "0" ]; then
     echo "WARNING: failed to update FJR with timestamps metrics, status=$status"
@@ -223,7 +225,7 @@ echo -e "======== WMAgent finished adding timestamps metrics to FJR ========\n"
 
 echo "WMAgent bootstrap: WMAgent finished the job, it's copying the pickled report"
 set -x
-cp WMTaskSpace/Report*.pkl ../
+cp WMTaskSpace/$outputFile ../
 ls -l WMTaskSpace
 ls -l WMTaskSpace/*

Could you please apply those and test real quick?

vkuznet · 2023-09-26T21:20:35Z

Alan, I made necessary changes. Here is one document based on this changes.

cmsdmwmbot · 2023-09-26T21:52:45Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests deleted
- 9 tests no longer failing
- 5 tests added
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 24 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14519/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-26T23:13:47Z

test this please

cmsdmwmbot · 2023-09-26T23:24:18Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 9 tests no longer failing
- 5 tests added
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 24 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14521/artifact/artifacts/PullRequestReport.html

amaltaro

Thanks Valentin! It would have been better to have squashed the 2nd and 3rd commit, but I will just proceed and put it in in the upcoming release.

vkuznet added the PR: Work in progress label Sep 19, 2023

vkuznet self-assigned this Sep 19, 2023

vkuznet changed the title ~~Fix wma wmtiming~~ Fix WMTiming metrics in FJR pkl file Sep 19, 2023

vkuznet force-pushed the fix-wma-wmtiming branch from 30dd66f to 58db70c Compare September 19, 2023 19:32

vkuznet force-pushed the fix-wma-wmtiming branch from 58db70c to 4bcff1b Compare September 20, 2023 14:31

vkuznet requested a review from amaltaro September 20, 2023 14:58

vkuznet removed the PR: Work in progress label Sep 20, 2023

vkuznet requested a review from todor-ivanov September 20, 2023 14:59

todor-ivanov approved these changes Sep 20, 2023

View reviewed changes

etc/submit_py3.sh Outdated Show resolved Hide resolved

amaltaro requested changes Sep 21, 2023

View reviewed changes

amaltaro mentioned this pull request Sep 25, 2023

WMCore v2.2.4 Validation #11632

Closed

vkuznet force-pushed the fix-wma-wmtiming branch from 380fb0e to 9f5b3d2 Compare September 26, 2023 12:16

vkuznet requested a review from amaltaro September 26, 2023 12:26

amaltaro requested changes Sep 26, 2023

View reviewed changes

vkuznet requested a review from amaltaro September 26, 2023 14:13

amaltaro requested changes Sep 26, 2023

View reviewed changes

etc/submit_py3.sh Outdated Show resolved Hide resolved

amaltaro reviewed Sep 26, 2023

View reviewed changes

etc/submit_py3.sh Outdated Show resolved Hide resolved

vkuznet force-pushed the fix-wma-wmtiming branch from b734706 to 57d470a Compare September 26, 2023 19:02

vkuznet requested a review from amaltaro September 26, 2023 19:17

vkuznet added 3 commits September 26, 2023 17:19

Update WMTiming metrics in given pkl file

e564f90

Fix unit test timing issue

a319e66

Fix Timestamps unit test

a88b2a2

vkuznet force-pushed the fix-wma-wmtiming branch from 57d470a to a88b2a2 Compare September 26, 2023 21:19

amaltaro approved these changes Sep 27, 2023

View reviewed changes

amaltaro merged commit cdfa94d into dmwm:master Sep 27, 2023

amaltaro mentioned this pull request Sep 27, 2023

Correct WMTiming metrics to be collected at the job wrapper #11718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WMTiming metrics in FJR pkl file #11726

Fix WMTiming metrics in FJR pkl file #11726

vkuznet commented Sep 19, 2023 •

edited by amaltaro

Loading

vkuznet commented Sep 19, 2023 •

edited

Loading

cmsdmwmbot commented Sep 19, 2023

vkuznet commented Sep 19, 2023

amaltaro commented Sep 19, 2023

vkuznet commented Sep 19, 2023

cmsdmwmbot commented Sep 19, 2023

amaltaro commented Sep 19, 2023

cmsdmwmbot commented Sep 20, 2023

vkuznet commented Sep 20, 2023 •

edited

Loading

amaltaro commented Sep 20, 2023

vkuznet commented Sep 20, 2023

cmsdmwmbot commented Sep 20, 2023

todor-ivanov left a comment

amaltaro left a comment

vkuznet commented Sep 21, 2023

cmsdmwmbot commented Sep 21, 2023

amaltaro commented Sep 21, 2023

vkuznet commented Sep 21, 2023

amaltaro commented Sep 21, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro commented Sep 26, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

Fix WMTiming metrics in FJR pkl file #11726

Fix WMTiming metrics in FJR pkl file #11726

Conversation

vkuznet commented Sep 19, 2023 • edited by amaltaro Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

vkuznet commented Sep 19, 2023 • edited Loading

cmsdmwmbot commented Sep 19, 2023

vkuznet commented Sep 19, 2023

amaltaro commented Sep 19, 2023

vkuznet commented Sep 19, 2023

cmsdmwmbot commented Sep 19, 2023

amaltaro commented Sep 19, 2023

cmsdmwmbot commented Sep 20, 2023

vkuznet commented Sep 20, 2023 • edited Loading

amaltaro commented Sep 20, 2023

vkuznet commented Sep 20, 2023

cmsdmwmbot commented Sep 20, 2023

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Sep 21, 2023

cmsdmwmbot commented Sep 21, 2023

amaltaro commented Sep 21, 2023

vkuznet commented Sep 21, 2023

amaltaro commented Sep 21, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro commented Sep 26, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

vkuznet commented Sep 26, 2023

cmsdmwmbot commented Sep 26, 2023

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Sep 19, 2023 •

edited by amaltaro

Loading

vkuznet commented Sep 19, 2023 •

edited

Loading

vkuznet commented Sep 20, 2023 •

edited

Loading