-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up listDatasetFileDetails API #11099
Conversation
@amaltaro , @todor-ivanov , @klannon this PR speed up single (most expensive) DBS3Reader API by factor of 5 or more. The code shows that concurrent execution of calls to upstream server can provide significant improvement. I suggest that you study this pattern which can be applied to different parts of WMCore codebase where multiple calls to upstream server (DBS, Rucio, etc.) will be required. Even thought it is not in Q2 plan I decided to put this forward as I obtained significant improvements can be made which entire system can benefit from and (more importantly) will allow to work with large datasets and avoid timeouts on CMSWEB. |
Jenkins results:
|
2a48633
to
5a2ac3c
Compare
Jenkins results:
|
@amaltaro , on my second commit I got weird unit test failure which I'm 100% sure is not related to my changes but I want to point you to it since it may show how this unit test became unstable. Please see https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-WMCore-PR-test/13026/testReport/junit/WMCore_t.Services_t.UUID_t/UUIDTest/testUUID/ which shows:
I think it is random error since UUID may be close enough during random process generation. Moreover, the message of test is wrong. I suggest that we address this unit issue separately. I'll trigger test again. |
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, even though this looks like a great performance boost, I do not think this is the right implementation.
Why don't we use the pycurl_manager module for this? We already have some DBS concurrent APIs implemented in this module (inherited from your work porting Unified features into WMCore):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/Tools/Common.py
and I think the best way forward would be to actually create a new module under:
WMCore/Services/DBS/youNameIt.py
and put this code in there, if possible relying on pycurl_manager only instead of requests
.
If you see a need to use the straight forward requests
library, then I think we could have a chat first before putting all this effort in.
Please let me know your thoughts (BTW, it's a holiday here, so you might hear back from me only on Monday).
Jenkins results:
|
@amaltaro I explained in description that |
ok, I refactored the code using
Bottom line, I would like to keep around concurrent.features code since it demonstrates new paradigm on how concurrent programming should be done. This concept can be applied to different parts of WMCore which may benefit from it. The |
New best value I achieved with |
Jenkins results:
|
3cf4f31
to
15d0454
Compare
Jenkins results:
|
15d0454
to
7ebeaf3
Compare
Jenkins results:
|
7ebeaf3
to
f8970cb
Compare
I re-factor the code and generalize it. The new API |
Jenkins results:
|
f8970cb
to
b337ce1
Compare
Alan, let me answer your questions in order they appear. For clarify I'll repeat the list:
|
test this please |
Jenkins results:
|
Jenkins results:
|
ca701ec
to
745a33d
Compare
Jenkins results:
|
Jenkins results:
|
61f6787
to
f07765c
Compare
Jenkins results:
|
4b071e9
to
13a1e92
Compare
Jenkins results:
|
Jenkins results:
|
ffcc61e
to
d318762
Compare
Jenkins results:
|
@amaltaro , I made necessary changes and replied to all your comments (some of them were unclear and I left my questions). Meanwhile, I run code through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, this looks good to me. I'd suggest to add the docstring I mentioned in the code though.
In addition to that, please:
- next time, please keep in mind that test/* changes should NOT go together with the source code changes, i.e., they must be provided in different commits;
- please squash these commits in a single one (if test was separated, we should squash them in 2 commits instead).
Thanks
e1fdc83
to
e2bfb55
Compare
Jenkins results:
|
c987a77
to
dfc5272
Compare
@amaltaro now you have it squashed and rebased. |
Jenkins results:
|
test this please |
Jenkins results:
|
@@ -72,17 +74,18 @@ class DBS3Reader(object): | |||
General API for reading data from DBS | |||
""" | |||
|
|||
def __init__(self, url, logger=None, **contact): | |||
def __init__(self, url, logger=None, parallel=None, **contact): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please create a docstring for this class and specify what parallel
is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
Fixes #11098
Status
ready
Description
Use concurrent features to speed up
listDatasetFileDetails
API. In my local setup I achieve speed up by factor of 5 or more for a specific dataset. Here are benchmark numbers for using/VBF1Parked/Run2012D-v1/RAW
dataset. This dataset contains 594 blocks. The current codebase takes approximately 190 seconds to fetch its parents and file lumis. Using proposed solution I achieve the following numbers:Internally, I used
requests
python library instead ofpycurl_manager.py
. The latter is not suitable for concurrent execution since it is not thread safe as it holds global curl object. As such curl options are set on first tasks, but can't be changed (since code set them up) in others until first task is completed. Therequests
library does not depend on global curl object and curl library itself, and it is suitable for concurrent execution of multiple URL calls.This PR only addresses speed up of single
listDatasetFileDetails
DBS3Reader API but other APIs where multiple (sequential) calls to DBS are made can benefit from this approach.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
dmwm/dbs2go#5
External dependencies / deployment changes
relies on Python
requests
library