-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WMTiming metrics in FJR pkl file #11726
Conversation
Here is a link to recent Report.0.pkl produced by one of the jobs I run. It properly contains
and it is properly appears in couchDB FJR (see
You can find this report over here. And, finally I see now documents in OpenSearch/ES, e.g. here is one with |
Jenkins results:
|
I found that one unit test fails:
which is random failure as it depends on times. It will be trivial to fix this test by changing |
I don't know if you still has this code in, but this is an extremely wild workaround that I strongly disagree. Given that we have already been working on this and learning many things, let us complete this phase and have the right fix at the right place. If there is really no other option, then we can re-evaluate something like that. |
30dd66f
to
58db70c
Compare
Alan, I never understand your logic of processing PRs, for those which does not have label in progress you complain about it, for those which does have this label you also review when are not explicitly asked for it. I don't understand it. This is work in progress where I collect my findings and make diary once I progress. If you want to review then wait for a request (this is the rule you insist :). |
Jenkins results:
|
Oh, it wasn't really a review. I got a notification in my inbox and decided to provide my feedback on that comment before you start relying on it as a final product :-P |
58db70c
to
4bcff1b
Compare
Jenkins results:
|
@amaltaro , Alan, now this PR is ready for review. Please refer to my ticket comment #11726 (comment) which provides details of my investigation, code changes and reference to pkl json documents. Also, please refer to this comment #11726 (comment) if you want me to fix unstable |
@vkuznet Valentin, yes, please provide a fix to the unstable unit test. Feel free to add it to this PR as a separate commit. |
done with |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vkuznet,
Modulo my comment about one redundant line, this looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the WMTiming section to the right Report pickle file, instead of overdoing it and adding it to every single report file that you can find in the node scratch area.
In addition, searching for the Timestamps.py file isn't optimal. Let us make it simple please and just stick to $pythonCommand Utils/Timestamps.py
, provided that this path has been properly tested.
Alan, please clearly define what is a right file. So far I have no clue what it is. Based on my finding I see that we create multiple ones, one with
but I do not touch report files in Said that, now as we settle on location of |
Jenkins results:
|
@vkuznet Valentin, the digit part of the Report.*.pkl file follows the actual WMAgent job retry (which is either one of the classads or a command line argument passed to the job wrapper). In other words, in a given grid job, we should NEVER see a Report.0.pkl and Report.1.pkl. I see you already ruled out the report files under the WMTaskSpace), so we are one step closer to finally finding out which is the actual file that needs to be modified. Please refer to my comment for further details: #11718 (comment) , which I believe our discussion was pointing that you would investigate which job report needs to be modified and make the proper fix. |
Alan, you already confirmed that we may have multiple files, i.e. job may have multiple retries. In my view, each report pkl file in this case should contains this metrics. Once again I do not posses full knowledge about all possible use-cases of WMA workflows. I have no idea how many files should be create in different jobs, and therefore if you want to identify right series of files (since we may have multiple job retries) we need to go through all possible job workflows and identify for each workflow corresponding pkl file. This, I fear, will take long time to process and identify all use-case. Moreover, I expect that new cases may arise in a future. Bottom line, I think I provide THE BEST approach to identify top level report pkl files without subtasks accounting for multiple job retires. If you still disagree please provide the following:
Once you clarify that I can start digging further, but I outlined that this will take lots of time, we may not cover all possible use-cases and we can't foresee all future use-cases. I still think that current changes cover everything we need without any additional overhead (neither at execution time since find comment is very precise to look for Report.*.pkl pattern without sub-tasks) or overhead of adding new metrics to pkl files since metrics are at top level and report pkl file I tested show appropriate Framework steps. |
Valentin, please forget the job retry and re-read my previous comment. The missing piece for WMTiming is to identify what is the actual report that needs to have this information? Again, please refer to this comment as well: #11718 (comment) And the locations of job report is (copy/pasting from the other comment):
given that you are filtering out WMTaskSpace in your command, we can conclude that either Whether it is Report.0.pkl, or Report.1.pkl, or what, there is no need to run any
|
380fb0e
to
9f5b3d2
Compare
Alan, I identified that last report pkl file is the document to be used. It is copied from
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, please investigate this deleted (?) unit test reported by jenkins:
WMCore_t.WMRuntime_t.Scripts_t.Timestamps_t.TimestampsTest:testAddTimestamps (success) was deleted
It has not been deleted, so you probably have a broken import or something like that.
yes, it is because we moved |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, instead of moving all those lines around in the job wrapper, I think we should simply pass the correct --reportFile
value when executing Timestamps.py (see comment along the code).
b734706
to
57d470a
Compare
With newly applied changes I do see WMTiming in WMArchvie, see this fresh document |
Jenkins results:
|
@vkuznet Valentin, it's great that it works, but this is not what I suggested. I would rather not change how the wrapper script is organized right now. You should be able to accomplish the same results with simpler changes like:
Could you please apply those and test real quick? |
57d470a
to
a88b2a2
Compare
Alan, I made necessary changes. Here is one document based on this changes. |
Jenkins results:
|
test this please |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Valentin! It would have been better to have squashed the 2nd and 3rd commit, but I will just proceed and put it in in the upcoming release.
Fixes #11718 (follow up from #11604)
Complement to #11656
Status
ready
Description
Fixes propagation of WMMetrics into Report pkl files
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
#11656
External dependencies / deployment changes