Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WMTiming metrics in FJR pkl file #11726

Merged
merged 3 commits into from
Sep 27, 2023
Merged

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Sep 19, 2023

Fixes #11718 (follow up from #11604)
Complement to #11656

Status

ready

Description

Fixes propagation of WMMetrics into Report pkl files

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11656

External dependencies / deployment changes

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 19, 2023

  • I generalize adding WMTiming metrics to all Report.pkl files within condor job execution (this is done via find command which finds all Report.pkl file and adds WMTiming metrics to it).
  • the exit status 2 refers to missing Timestamps.py module since it is part of WMCore.zip shipped to the node. Therefore, I relocated this script under etc area of WMCore such that it will be available on worker node
  • Please note that on worker node we ship WMCore.zip which is appended to PYTHONPATH but since it is not guaranteed that unzip will be available on worker node we can't extract anything from this archive.
  • I fixed logic of Timestamps.py module which pass config into helper function but it is passed by value, i.e. it is local copy, and never being updated. I also relaxed condition to compare sizes as os.path.getsize is not perfect to determine size of the Python objects and added config may fit into padding of existing config Python object.

Here is a link to recent Report.0.pkl produced by one of the jobs I run. It properly contains WMTiming metrics in it, see

...
FrameworkJobReport.section_('WMTiming')
FrameworkJobReport.WMTiming.WMJobStart = 1695218580.0
FrameworkJobReport.WMTiming.WMTotalWallClockTime = 15.0
FrameworkJobReport.WMTiming.WMJobEnd = 1695218595.0
...

and it is properly appears in couchDB FJR (see curl xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1308-0 | jq):

    "WMTiming": {
      "WMJobStart": 1695218580,
      "WMTotalWallClockTime": 15,
      "WMJobEnd": 1695218595
    },

You can find this report over here.

And, finally I see now documents in OpenSearch/ES, e.g. here is one with WMTiming metrics.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14489/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet changed the title Fix wma wmtiming Fix WMTiming metrics in FJR pkl file Sep 19, 2023
@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 19, 2023

I found that one unit test fails:

File "/build/cmsbld/jenkins/workspace/DMWM-WMCorePy3-PR-unittests/SLICE/2/label/cms-dmwm-cc7/code/test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py", line 264, in testSubprocessTime
    self.assertTrue(userTime1 > userTime0)

which is random failure as it depends on times. It will be trivial to fix this test by changing userTime1 > userTime0 condition to userTime1 >= userTime0 but it is out of scope of this PR. Please let me know if I should do it.

@amaltaro
Copy link
Contributor

I generalize adding WMTiming metrics to all Report.pkl files within condor job execution (this is done via find command which finds all Report.pkl file and adds WMTiming metrics to it).

I don't know if you still has this code in, but this is an extremely wild workaround that I strongly disagree.

Given that we have already been working on this and learning many things, let us complete this phase and have the right fix at the right place. If there is really no other option, then we can re-evaluate something like that.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 19, 2023

Alan, I never understand your logic of processing PRs, for those which does not have label in progress you complain about it, for those which does have this label you also review when are not explicitly asked for it. I don't understand it. This is work in progress where I collect my findings and make diary once I progress. If you want to review then wait for a request (this is the rule you insist :).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests deleted
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14491/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

Oh, it wasn't really a review. I got a notification in my inbox and decided to provide my feedback on that comment before you start relying on it as a final product :-P

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 2 tests no longer failing
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14493/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 20, 2023

@amaltaro , Alan, now this PR is ready for review. Please refer to my ticket comment #11726 (comment) which provides details of my investigation, code changes and reference to pkl json documents. Also, please refer to this comment #11726 (comment) if you want me to fix unstable CMSSW_t.py unit test (which comes and goes based on run conditions).

@amaltaro
Copy link
Contributor

@vkuznet Valentin, yes, please provide a fix to the unstable unit test. Feel free to add it to this PR as a separate commit.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 20, 2023

done withCMSSW_t.py too, will see how jenkins build will go.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 2 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14494/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vkuznet,

Modulo my comment about one redundant line, this looks good to me.

etc/submit_py3.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the WMTiming section to the right Report pickle file, instead of overdoing it and adding it to every single report file that you can find in the node scratch area.

In addition, searching for the Timestamps.py file isn't optimal. Let us make it simple please and just stick to $pythonCommand Utils/Timestamps.py, provided that this path has been properly tested.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2023

Alan, please clearly define what is a right file. So far I have no clue what it is. Based on my finding I see that we create multiple ones, one with Report.N.pkl where N is numerical, at top level area, others in WMTaskSpace. Based on this I did the right thing, i.e. I only modify Report.N.pkl, see

files=`find . -name "Report*.pkl" | grep -v WMTaskSpace`

but I do not touch report files in WMTaskSpace since I figured out they are intermediate ones. So, how many report files we have, or what is a range of N. So far I saw Report.0.pkl, but can we have Report.1.pkl and so on I have no idea. May be it depends on job kind or job processing status. Until you provide clear definition this is the best I can do.

Said that, now as we settle on location of Timestamps.py I can fix its path (again assuming that this path will not be changed in a future).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14496/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@vkuznet Valentin, the digit part of the Report.*.pkl file follows the actual WMAgent job retry (which is either one of the classads or a command line argument passed to the job wrapper). In other words, in a given grid job, we should NEVER see a Report.0.pkl and Report.1.pkl.

I see you already ruled out the report files under the WMTaskSpace), so we are one step closer to finally finding out which is the actual file that needs to be modified.

Please refer to my comment for further details: #11718 (comment) , which I believe our discussion was pointing that you would investigate which job report needs to be modified and make the proper fix.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 21, 2023

Alan, you already confirmed that we may have multiple files, i.e. job may have multiple retries. In my view, each report pkl file in this case should contains this metrics. Once again I do not posses full knowledge about all possible use-cases of WMA workflows. I have no idea how many files should be create in different jobs, and therefore if you want to identify right series of files (since we may have multiple job retries) we need to go through all possible job workflows and identify for each workflow corresponding pkl file. This, I fear, will take long time to process and identify all use-case. Moreover, I expect that new cases may arise in a future. Bottom line, I think I provide THE BEST approach to identify top level report pkl files without subtasks accounting for multiple job retires. If you still disagree please provide the following:

  • outline each possible job workflow we have and provide corresponding input json for injection such that we can run those workflows and identify which files are created
  • please answer the question which pkl file should contain WMTiming metrics which are not since this metrics are at top level they can be applicable to all possible use-cases, e.g. full report, each individual step report, to log collection or archiving or storage steps. I do not know from monitoring point of view if we want to have timing metrics for each step, it can be, or not. Do we need these metrics for successful or aborted jobs, if latter, do we need to distinguish abort cases, e.g. failure to stage out, or failure of CMSSW job or failure of WMA component, and if this is the case do we write individual pkl files , if so which ones?

Once you clarify that I can start digging further, but I outlined that this will take lots of time, we may not cover all possible use-cases and we can't foresee all future use-cases. I still think that current changes cover everything we need without any additional overhead (neither at execution time since find comment is very precise to look for Report.*.pkl pattern without sub-tasks) or overhead of adding new metrics to pkl files since metrics are at top level and report pkl file I tested show appropriate Framework steps.

@amaltaro
Copy link
Contributor

Valentin, please forget the job retry and re-read my previous comment.

The missing piece for WMTiming is to identify what is the actual report that needs to have this information? Again, please refer to this comment as well: #11718 (comment)

And the locations of job report is (copy/pasting from the other comment):

  • /srv/job/Report.0.pkl
  • /srv/Report.0.pkl # copied from WMTaskSpace/Report*.pkl at the end of the job wrapper
  • /srv/job/WMTaskSpace/Report.0.pkl
  • /srv/job/WMTaskSpace/STEP_NAME/Report.pkl # like cmsRun1, cmsRun2, logArch1, stageOut1

given that you are filtering out WMTaskSpace in your command, we can conclude that either /srv/Report.0.pkl or /srv/job/Report.0.pkl is responsible for carrying the WMTiming section. Which one is it?

Whether it is Report.0.pkl, or Report.1.pkl, or what, there is no need to run any find based discovery here. We have this information defined in the submit_py3 wrapper as:

outputFile="Report.$RETRY_NUM.pkl"
touch $outputFile

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 26, 2023

Alan, I identified that last report pkl file is the document to be used. It is copied from WMTaskSpace/Report*.pkl to top level destination. I adjusted the code (no find command anymore) and I rerun jobs. The data appears in MONIT, e.g. see this document. Or, you may check local couchDB using this workflow wmagent_small_MC_Agent0291_230921_215343_5650, e.g.

curl -s xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1363-0 | jq

@vkuznet vkuznet requested a review from amaltaro September 26, 2023 12:26
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests deleted
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14514/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, please investigate this deleted (?) unit test reported by jenkins:

    WMCore_t.WMRuntime_t.Scripts_t.Timestamps_t.TimestampsTest:testAddTimestamps (success) was deleted

It has not been deleted, so you probably have a broken import or something like that.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 26, 2023

yes, it is because we moved Timestamps.py from runtime to utils area. I fixed the test and move it too to test/python/Utils_t area.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 9 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14515/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro September 26, 2023 14:13
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, instead of moving all those lines around in the job wrapper, I think we should simply pass the correct --reportFile value when executing Timestamps.py (see comment along the code).

etc/submit_py3.sh Outdated Show resolved Hide resolved
etc/submit_py3.sh Outdated Show resolved Hide resolved
@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 26, 2023

With newly applied changes I do see WMTiming in WMArchvie, see this fresh document

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 9 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14517/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro September 26, 2023 19:17
@amaltaro
Copy link
Contributor

@vkuznet Valentin, it's great that it works, but this is not what I suggested. I would rather not change how the wrapper script is organized right now. You should be able to accomplish the same results with simpler changes like:

--- a/etc/submit_py3.sh
+++ b/etc/submit_py3.sh
@@ -213,8 +213,10 @@ echo -e "======== WMAgent Run the job FINISH at $(TZ=GMT date) ========\n"
 
 echo "======== WMAgent add timestamps metrics to FJR $outputFile ========"
 timeEndSec=`date +%s`
-reportIn=$outputFile
-$pythonCommand Timestamps.py --reportFile=$outputFile --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec
+reportIn=WMTaskSpace/$outputFile
+tpy=$HOME/job/Utils/Timestamps.py
+echo -e "$pythonCommand $tpy --reportFile=$reportIn --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec"
+$pythonCommand $tpy --reportFile=$reportIn --wmJobEnd=$timeEndSec --wmJobStart=$timeStartSec
 status=$?
 if [ "$status" != "0" ]; then
     echo "WARNING: failed to update FJR with timestamps metrics, status=$status"
@@ -223,7 +225,7 @@ echo -e "======== WMAgent finished adding timestamps metrics to FJR ========\n"
 
 echo "WMAgent bootstrap: WMAgent finished the job, it's copying the pickled report"
 set -x
-cp WMTaskSpace/Report*.pkl ../
+cp WMTaskSpace/$outputFile ../
 ls -l WMTaskSpace
 ls -l WMTaskSpace/*

Could you please apply those and test real quick?

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 26, 2023

Alan, I made necessary changes. Here is one document based on this changes.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests deleted
    • 9 tests no longer failing
    • 5 tests added
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14519/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 26, 2023

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 9 tests no longer failing
    • 5 tests added
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14521/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Valentin! It would have been better to have squashed the 2nd and 3rd commit, but I will just proceed and put it in in the upcoming release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Correct WMTiming metrics to be collected at the job wrapper
4 participants