-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WMAgent fails to inject some files into rucio #9763
Comments
@FernandoGarzon found this initially while working through the filemismatch backlog in unified. |
Hi @nsmith- @FernandoGarzon thanks for reporting it. which basically report the same temporary inconsistency between DBS and PhEDEx. I'll look into it once I finish the bug-fix for input data placement. Meanwhile, please keep this ticket up-to-date in case you see that things went into a consistent state. |
Hi @amaltaro , rucio: |
and here are the wfs which are stuck because of it: |
Hi @amaltaro Some more few examples:
All of these are valid in DBS. |
Here are some few files that are valid in dbs and in Rucio, but they don't have physical replica yet:
|
Nick, Fernando, @nsmith- @FernandoGarzon I just had a look at vocms0251 RucioInjector logs and I see tons of failures since June 20, e.g.:
this seems to be related to the changes Eric was asking us to validate a week or two ago (which went fine in testbed somehow). In short, it seems none of the NANO data is managing to get injected into Rucio since the Lexicon DID validation was put in place. Apologies for not spotting it before. |
Uh it looks to me like the adler32 is what is failing validation. Why is it not 8 characters? I would guess a leading zero is being dropped somewhere |
I've been looking how this adler32 checksum gets calculated, and I initially thought it was from the CMSSW framework. That does not seem to be the case. So I'm investigating other parts of the code. |
As far as I can tell, the adler32 field validation has been in place since the beginning. I'm looking at what's been injected into rucio by WMCore and it seems there continues to be new files injected with adler32 starting digit 0. |
Alright, I think I found where exactly the adler32 checksum gets calculated, here: Now that we know where to look at, I transferred one of the files that fails to get inserted into Rucio
and the file is here:
And here is my script - which executes
About the solution, I'm going to change that function such that it adds leading zeroes to always result in 8 chars length checksum. In addition to that, we also need to patch the RucioInjector component to add leading zeroes to what has already been calculated and persisted in the database. @nsmith- do you see any problem with this approach? |
No, I think this is OK. But I'm a bit confused why the leading zeros are stripped sometimes but not others, as I see recently injected files with a leading 0. |
Could it be that those are not really leading 0, it just happened to be 0? |
Actually, I wonder if this modification - for already created files - can actually create any sort of problems in the DM system? Reason is, DBS and Rucio will have different adler32 checksum values... |
Indeed DBS has a 7-digit adler32 for your example file: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?detail=1&logical_file_name=/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root To confirm, indeed it does have a leading zero:
I checked the files in the list from Fernando above and not all of them have this issue, but a fair number do:
|
I looked in TMDB and I do find a lot of examples where the adler32 is shorter than 8 characters. I can only assume that everywhere it is used (essentially just FTS) it is converted back to a proper 32-bit value. One puzzle though, it seems new files are being inserted into TMDB with checksums with leading zeros even today, just as new files are going into Rucio OK. Why does this bug not affect all agents? |
From what I read, adler32 is calculated with successive additions of chunks of the data. So I think one possible scenario would be that the sum goes beyond Let's first clear this problem from the system, then we can investigate the other files that are apparently missing in Rucio. |
The files backlog should be over. @FernandoGarzon please let us know if you see further files missing there. |
@FernandoGarzon @nsmith- I haven't heard anything else here - since we fixed the adler32 issue - could you please check whether things are fine on your side, and if so, close this ticket? Thanks |
Hello I've been running consistency check everyday for the last week. I just made a last run. I haven't found a single file with the issue described. It seems fine to me. |
Thanks for confirming it, Fernando. Please reopen it if the issue pops up again in the coming days/weeks. |
Impact of the bug
Bug causes DBS-rucio inconsistency.
Describe the bug
Unified detected a mismatch between DBS and Rucio (well, rucio/phedex) in the number of files for a NanoAOD output dataset. Indeed, if one compares the DBS block file listing with
we see that the following two files are in the block in DBS but not in rucio:
Both files are present on storage at
T2_US_Wisconsin
, the origin site for the block.How to reproduce it
Unknown
Expected behavior
All files in DBS should be part of the block in rucio.
The text was updated successfully, but these errors were encountered: