-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we still need prentsbyLumi API? #613
Comments
@yuyiguo I don't see I think it's safe to deprecate/disable this API. |
Hi Yuyi, let me get back to this issue. As we have recently noticed, we do rely heavily on this Indeed it times out for quite some blocks that we're trying to process, see WMCore issue: while trying to provide a slice of LFNs (in order to workaround the frontends timeout), I found the following server bug (reported as a client bug):
Do we have another option to try to make it working? The block above has 231 files, and that's already too much to fit within 5min. |
The problem is likely somewhere in this DAO: but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it. |
Alan, It is a lot of message here. In simple words, I have no solution to get that work in 5 minutes. I tried the example blocks and list of files in this ticket with bare sql statements and no matter what I did, it just could not finish in 5 minutes. We have big datasets now and it could not finish in 5 minutes. Using LFN list will not reduce the searching time. I think the reason was because the LFNs were only limit to the children, but the parents were for the entire datasets. The children already much smaller than the parents because we use block to limit them. Details: The original sql: ),
select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L So what I did was just remove the file list input and search for the entire block. See below query: ),
select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L ` Then I added the file list back to the query as below, instead of use subquery, with with clause, I put the query directly there.
; |
I see! Thank you for looking into it, Yuyi. |
Alan, |
Alan, In order to handle the data already generated in DBS. I am think that we may break listFileParentsbyLumi(bock_name) API into two APIs.
Once we have both data. We can find the unique match of 1 and 2 in python. If you think this is something you guys want to approach, I will test if we can get all the lumi numbers for a big dataset in 5 minutes. Yuyi |
I really don't know if proposed solution will sustain since it seems to me that sooner or later you'll again hit the limit of 5 min. This I consider as a temporary fix but it does not solve the problem. If DB can't handle the load with parentage they way DMWM queries it we can't fix it in APIs, we either need to do something on DB level to speed up those queries, e.g. run procedure function to generate this info in background, re-factor DB to handle parentage use-case, or move/use other solution (HDFS) to get parentage. But before that I rather prefer to see full description WHY do we need to support this use-case. From ticket description it is unclear why DMWM needs this, apart that it is heavily rely on this. |
Yuyi, I'm not sure I followed your suggestion. Are you saying that we could:
And fix the parentage relationship on the application side, then injecting a list of parent file ids for a given child file id? If this is what you're saying, then we should likely use block level operations to avoid eating the whole memory when parsing it. Valentin, the problem has been reported here: and this is how we decided to solve StepChain parentage handling; because things happen asynchronously, meaning that we could insert into DBS data for a NANOAODSIM dataset, while the AODSIM hasn't even been merged yet. I don't discard having the posssibility to fix it withing WMAgent, but I'm pretty sure that will be a substantial change and it can't be done within a few weeks. |
Valentin, Alan, What I proposed was that a temperate solution to fix the existing data that already in DBS.
When I proposed that solution, I thought that
Maybe we should discuss more on the problem before offering any fixing. |
Yuyi, it actually isn't/wasn't in our plans to modify how the StepChain parentage works because I wasn't aware of such limitations. I think the question back then when this issue was being discussed was: how much work would it be to get such problem fixed in WMCore and in DBS. Which is still a valid question. |
Alan, WMAgents have their own databases, Can the partial data wait in its local db? WMAgent insert NANOAODSIM dataset into its local database while waiting for AODSIM is merged. Then upload the completed dataset/block into DBS? |
Your understanding is correct, Yuyi. The merging step is asynchronous, and so is data injection against phedex/DBS. So yes, StepChain output will likely be always broken. Yes, we can definetely implement what we need in the agents. However, I'd like first to explore a DBS/database-side option, if we still can find one. FYI @todor-ivanov |
Alan, We are going to partition DBS files and file_lumis tables. I think the queries against these two tables will be improved after the partition. But the partition will take some time. How long can you wait? Kate is out of office this week. I will discuss with her the partition schedule next week. |
We have a cherrypy thread running every 3h: even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)? I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed. I'm also about to leave on vacation (day after tomorrow). |
Alan, if single query takes 5 min or more, then your concurrent access will be worth since query will use transactions and all concurrent queries will block DBS.
Even though a temp solution may be put in place you need to fix the root of the problem. From what I read and understood in this ticket the proper solution should be done in wmcore and not in DBS code.
Sent from ProtonMail mobile
…-------- Original Message --------
On Feb 11, 2020, 11:01 PM, Alan Malta Rodrigues wrote:
We have a cherrypy thread running every 3h:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py
even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?
I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.
I'm also about to leave on vacation (day after tomorrow).
—
You are receiving this because you commented.
Reply to this email directly, [view it on GitHub](#613?email_source=notifications&email_token=AAA6RUWMYOR4VF2UEPPRTO3RCMN35A5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELOIFTA#issuecomment-584876748), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAA6RUSG2BLUGVZEA4SADM3RCMN35ANCNFSM4H3DCQYA).
|
I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that? |
But if select requires full table scan it means that N queries will stale. Anyway, it's better that Yuyi confirm.
Sent from ProtonMail mobile
…-------- Original Message --------
On Feb 12, 2020, 8:22 AM, Alan Malta Rodrigues wrote:
I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?
—
You are receiving this because you commented.
Reply to this email directly, [view it on GitHub](#613?email_source=notifications&email_token=AAA6RUVTTNYQTN2DV5CSMLTRCOPRZA5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELPWXLI#issuecomment-585067437), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAA6RUSHP5AJQQOBAMBHHADRCOPRZANCNFSM4H3DCQYA).
|
@amaltaro
The listFileParentsbyLumi(bock_name) API was requested by Seangchan while he had to deal with a bug in the agent that made some of files don't have parents. We had an unit test for the API. However, this API most time cannot finished in 5 minutes due to the blocks and database are much bigger than the time we created the API. This API was only created for Seancheng to do the recovery. If no more recovery to do. I 'd like to disable it.
The text was updated successfully, but these errors were encountered: