Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vocms0310.cern.ch JobSubmitter #7418

Closed
scarletnorberg opened this issue Nov 23, 2016 · 8 comments
Closed

vocms0310.cern.ch JobSubmitter #7418

scarletnorberg opened this issue Nov 23, 2016 · 8 comments
Assignees

Comments

@scarletnorberg
Copy link

I think Jean Roc and I have restarted the same component twice although because elog seems to not be working I can not be 100% sure.

Here is the error:

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/BossAir/Plugins/PyCondorPlugin.py", line 84, in submitWorker
stdout, stderr, returnCode = SubprocessAlgos.runCommand(cmd=command, shell=True, timeout=timeout)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 118, in runCommand
raise SubprocessAlgoException(msg)
SubprocessAlgoException: SubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2772682.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0

Traceback:
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 111, in runCommand
stdout, stderr = pipe.communicate()

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 799, in communicate
return self._communicate(input)

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1409, in _communicate
stdout, stderr = self._communicate_with_poll(input)

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1463, in _communicate_with_poll
ready = poller.poll()

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
raise Alarm

2016-11-23 17:48:37,731:140302358836992:ERROR:SubprocessAlgos:Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
2016-11-23 17:48:37,734:140302358836992:ERROR:PyCondorPlugin:Critical error in subprocess while submitting to condorSubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0

Traceback:
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 111, in runCommand
stdout, stderr = pipe.communicate()

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 799, in communicate
return self._communicate(input)

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1409, in _communicate
stdout, stderr = self._communicate_with_poll(input)

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1463, in _communicate_with_poll
ready = poller.poll()

File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
raise Alarm

Traceback (most recent call last):
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/BossAir/Plugins/PyCondorPlugin.py", line 84, in submitWorker
stdout, stderr, returnCode = SubprocessAlgos.runCommand(cmd=command, shell=True, timeout=timeout)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 118, in runCommand
raise SubprocessAlgoException(msg)
SubprocessAlgoException: SubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0

Please check this machine. It is green now but as it has already been restarted twice today I am worried it will have an issue soon.

@amaltaro
Copy link
Contributor

Scarlet, I see no crashes in JobSubmitter in vocms0310. Did you want to report another node/component?

@amaltaro
Copy link
Contributor

I'm updating the subject because I'm pretty sure you wanted to report vocms0130.

These failures came up from the condor schedd instabilities. The glidein team should have alarms in place for these cases (maybe not actually an alarm, but monitoring in gwmsmon).

I can offer you two short-term improvements:
a) crash the component and let the production team to follow these issues up with the glidein team (so no changes at all)
b) don't crash the component, simply skip it and wait for the next cycle.

I know you're probably going to ask for an alarm here. I insist the alarm should be on the glidein/global pool land.

@scarletnorberg
Copy link
Author

I do not think this is exactly related. But remember I said that we had this issue where JobSubmitter was not reporting on WMstats. I think you said it was a sched issue. This has not been happening a lot but has started happening again on vocms0304. I do not remember now what came out of it and I keep forgetting to copy the error I am sure this will happen again. Jean Roch wants me to put it in drain which I will do now. I am not sure if this will continue to happen or start happening on another machine. Can you have a look at this agent? @amaltaro

@amaltaro
Copy link
Contributor

Scarlet, are you sure it's vocms0304 and JobSubmitter component? I don't see a single crash in the last week, though there are several schedd unresposive errors in the logs

2017-04-13 20:47:18,164:140603631941376:ERROR:SimpleCondorPlugin:Failed to connect to schedd.
Traceback (most recent call last):
  File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 146, in submit
    clusterId = schedd.submitMany(clusterAd, procAds)
RuntimeError: Failed to connect to schedd.

which I confirm from the numerous HTCondor alerts I received over the holidays...

@scarletnorberg
Copy link
Author

When I saw this I was like what holiday but yes I forgot CERN was on Holiday. Yes I think this is exactly related. I feel like last time I talked to you and you did something and this issue just disappeared. Not sure what you did or if you actually did anything but figured I would give it a try again and see happens! I think it is a schedd issue. It does not report an error like normal the component goes down and has no file that is associated with but I do not think an error is seen. More details are in the above linked page.

@scarletnorberg
Copy link
Author

303 and 309:

Here are their logs: 303 first than 309:
10:19:55,185:140010949039872:INFO:JobSubmitterPoller:Have 2 packages to submit.
2017-05-14 10:19:55,185:140010949039872:INFO:JobSubmitterPoller:Have 3 jobs to submit.
2017-05-14 10:19:55,185:140010949039872:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 10:19:58,386:140010949039872:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 10:19:58,392:140010949039872:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 3/0.
2017-05-14 10:19:58,406:140010949039872:INFO:DashboardReporter:Handling 3 jobs
2017-05-14 10:20:00,514:140010949039872:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 10:22:02,750:140010949039872:INFO:JobSubmitterPoller:Refreshing priority cache with currently 19951 jobs
2017-05-14 10:22:05,320:140010949039872:INFO:JobSubmitterPoller:Found 19986 new jobs to be submitted.
2017-05-14 10:22:05,321:140010949039872:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 10:22:07,817:140010949039872:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 10:22:09,435:140010949039872:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 2648}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 970}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 1769}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 959}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 3818}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 1024}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 2312}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 974}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 1078}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 4772}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 966}
), 'T2_FR_GRIF_LLR': Counter(
{'NoPendingSlot': 974}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 1949}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 1281}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 966}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 961}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 958}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 1601}
), u'T2_US_Wisconsin': Counter(
{'NoPendingSlot': 953}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 2436}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 3245}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1021}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 954}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 1901}
), 'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 953, 'submitted': 2}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 2368}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 1785}
), 'T2_US_Vanderbilt': Counter(
{'submitted': 3}
)}
2017-05-14 10:22:09,437:140010949039872:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 19}
), 90000.0: Counter(
{'Total': 13}
), 130000.0: Counter(
{'Total': 2749}
), 30080000.0: Counter(
{'Total': 48}
), 50085000.0: Counter(
{'Total': 12}
), 80000.0: Counter(
{'Total': 11154, 'submitted': 2}
), 50130000.0: Counter(
{'Total': 4101, 'submitted': 3}
), 50090000.0: Counter(
{'Total': 6}
), 30085000.0: Counter(
{'Total': 31}
), 85000.0: Counter(
{'Total': 1844}
), 30090000.0: Counter(
{'Total': 9}
)}
2017-05-14 10:22:09,438:140010949039872:INFO:JobSubmitterPoller:Have 2 packages to submit.
2017-05-14 10:22:09,438:140010949039872:INFO:JobSubmitterPoller:Have 5 jobs to submit.
2017-05-14 10:22:09,438:140010949039872:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 10:22:11,425:140010949039872:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 10:22:11,436:140010949039872:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 5/0.
2017-05-14 10:22:11,468:140010949039872:INFO:DashboardReporter:Handling 5 jobs
2017-05-14 10:22:13,732:140010949039872:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 10:24:15,302:140010949039872:INFO:JobSubmitterPoller:Refreshing priority cache with currently 19981 jobs
2017-05-14 10:24:17,911:140010949039872:INFO:JobSubmitterPoller:Found 20064 new jobs to be submitted.
2017-05-14 10:24:17,911:140010949039872:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 10:24:18,628:140010949039872:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 10:24:20,013:140010949039872:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 2648}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 970}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 1777}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 959}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 3818}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 1033}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 2316}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 974}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 1082}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 4778}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 974}
), 'T2_FR_GRIF_LLR': Counter(
{'NoPendingSlot': 974}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 1958}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 1292}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 968}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 963}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 958}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 1602}
), u'T2_US_Wisconsin': Counter(
{'NoPendingSlot': 953}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 2436}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 3248}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1021}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 954}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 1902}
), u'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 953}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 2383}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 1785}
), 'T2_US_Vanderbilt': Counter(
{'submitted': 8}
)}
2017-05-14 10:24:20,014:140010949039872:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 19}
), 90000.0: Counter(
{'Total': 13}
), 130000.0: Counter(
{'Total': 2749}
), 30080000.0: Counter(
{'Total': 48}
), 50085000.0: Counter(
{'Total': 12}
), 80000.0: Counter(
{'Total': 11198, 'submitted': 6}
), 50130000.0: Counter(
{'Total': 4127, 'submitted': 1}
), 50090000.0: Counter(
{'Total': 6}
), 30085000.0: Counter(
{'Total': 31}
), 85000.0: Counter(
{'Total': 1852, 'submitted': 1}
), 30090000.0: Counter(
{'Total': 9}
)}
2017-05-14 10:24:20,015:140010949039872:INFO:JobSubmitterPoller:Have 3 packages to submit.
2017-05-14 10:24:20,015:140010949039872:INFO:JobSubmitterPoller:Have 8 jobs to submit.
2017-05-14 10:24:20,015:140010949039872:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 10:24:29,778:140010949039872:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 10:24:29,789:140010949039872:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 8/0.
2017-05-14 10:24:29,829:140010949039872:INFO:DashboardReporter:Handling 8 jobs
2017-05-14 10:24:34,645:140010949039872:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 10:26:35,570:140010949039872:INFO:JobSubmitterPoller:Refreshing priority cache with currently 20056 jobs
2017-05-14 10:26:43,459:140010949039872:INFO:JobSubmitterPoller:Found 20123 new jobs to be submitted.
2017-05-14 10:26:43,460:140010949039872:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 10:26:50,067:140010949039872:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 10:26:53,867:140010949039872:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 1172, 'submitted': 136}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 797, 'submitted': 32}
), 'T1_ES_PIC': Counter(
{'submitted': 4}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 2034, 'submitted': 3}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 1106, 'submitted': 20}
), 'T2_FR_GRIF_IRFU': Counter(
{'submitted': 6}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 2588, 'submitted': 160}
), 'T2_UK_London_IC': Counter(
{'submitted': 6}
), 'T2_IT_Bari': Counter(
{'submitted': 4}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 338, 'submitted': 44}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 44, 'submitted': 26}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 94, 'submitted': 37}
), 'T2_ES_CIEMAT': Counter(
{'submitted': 5}
), 'T2_FR_IPHC': Counter(
{'submitted': 2, 'NoPendingSlot': 1}
), 'T1_RU_JINR': Counter(
{'submitted': 118}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 917, 'submitted': 17}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 1361, 'submitted': 231}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 42, 'submitted': 26}
), 'T2_IT_Legnaro': Counter(
{'submitted': 21, 'NoPendingSlot': 2}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 468, 'submitted': 75}
), 'T2_UK_London_Brunel': Counter(
{'submitted': 1}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 1389, 'submitted': 19}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 621, 'submitted': 7}
)}
2017-05-14 10:26:54,196:140010949039872:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 19, 'submitted': 18}
), 90000.0: Counter(
{'Total': 13}
), 130000.0: Counter(
{'Total': 2749, 'submitted': 320}
), 30080000.0: Counter(
{'Total': 48, 'submitted': 39}
), 50085000.0: Counter(
{'Total': 12, 'submitted': 12}
), 80000.0: Counter(
{'Total': 4258, 'submitted': 251}
), 50130000.0: Counter(
{'Total': 4181, 'submitted': 269}
), 50090000.0: Counter(
{'Total': 6}
), 30085000.0: Counter(
{'Total': 31, 'submitted': 27}
), 85000.0: Counter(
{'Total': 1851, 'submitted': 59}
), 30090000.0: Counter(
{'Total': 9, 'submitted': 5}
)}
2017-05-14 10:26:54,197:140010949039872:INFO:JobSubmitterPoller:Have 92 packages to submit.
2017-05-14 10:26:54,197:140010949039872:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-05-14 10:26:54,197:140010949039872:INFO:JobSubmitterPoller:Done assigning site locations.
309:
2017-05-14 01:33:56,886:140650691544832:INFO:JobSubmitterPoller:Have 67 packages to submit.
2017-05-14 01:33:56,886:140650691544832:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-05-14 01:33:56,886:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 01:34:22,126:140650691544832:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 01:34:22,762:140650691544832:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 1000/0.
2017-05-14 01:34:25,877:140650691544832:INFO:DashboardReporter:Handling 1000 jobs
2017-05-14 01:35:31,357:140650691544832:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 01:37:31,858:140650691544832:INFO:JobSubmitterPoller:Refreshing priority cache with currently 23092 jobs
2017-05-14 01:37:35,592:140650691544832:INFO:JobSubmitterPoller:Found 23131 new jobs to be submitted.
2017-05-14 01:37:35,593:140650691544832:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 01:37:36,352:140650691544832:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 01:37:38,961:140650691544832:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 397}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 296}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 2619, 'submitted': 4}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 367}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 2901}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 7, 'submitted': 1}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 1175}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 96}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 691}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 1240}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 133}
), u'T2_FR_GRIF_LLR': Counter(
{'submitted': 133}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 2355}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 225}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 16}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 251}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 1}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 2020}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 2398}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 2513, 'submitted': 1}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1883, 'submitted': 2}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 163}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 1303}
), 'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 6}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 2708}
), 'T1_DE_KIT': Counter(
{'NoPendingSlot': 771}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 55}
)}
2017-05-14 01:37:38,963:140650691544832:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 1}
), 50090000.0: Counter(
{'Total': 9}
), 90000.0: Counter(
{'Total': 20}
), 130000.0: Counter(
{'Total': 2970}
), 50130000.0: Counter(
{'Total': 3068, 'submitted': 7}
), 80000.0: Counter(
{'Total': 17063, 'submitted': 134}
)}
2017-05-14 01:37:38,963:140650691544832:INFO:JobSubmitterPoller:Have 6 packages to submit.
2017-05-14 01:37:38,963:140650691544832:INFO:JobSubmitterPoller:Have 141 jobs to submit.
2017-05-14 01:37:38,963:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 01:37:48,319:140650691544832:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 01:37:48,430:140650691544832:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 141/0.
2017-05-14 01:37:48,906:140650691544832:INFO:DashboardReporter:Handling 141 jobs
2017-05-14 01:38:56,026:140650691544832:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 01:40:56,510:140650691544832:INFO:JobSubmitterPoller:Refreshing priority cache with currently 22990 jobs
2017-05-14 01:40:59,962:140650691544832:INFO:JobSubmitterPoller:Found 23110 new jobs to be submitted.
2017-05-14 01:40:59,962:140650691544832:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 01:41:03,455:140650691544832:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 01:41:06,678:140650691544832:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 294}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 307}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 2619}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 368}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 2905}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 1176, 'submitted': 2}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 101}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 693, 'submitted': 2}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 1118}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1883, 'submitted': 2}
), 'T2_UK_SGrid_RALPP': Counter(
{'submitted': 6}
), 'T2_FR_GRIF_LLR': Counter(
{'submitted': 2}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 134}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 227}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 19}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 253}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 1}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 2027}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 2265}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 2515, 'submitted': 3}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 2361}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 7, 'submitted': 3}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 1304}
), 'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 7}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 169}
), 'T1_DE_KIT': Counter(
{'NoPendingSlot': 772}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 58}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 2708}
)}
2017-05-14 01:41:06,679:140650691544832:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 2}
), 50090000.0: Counter(
{'Total': 9}
), 90000.0: Counter(
{'Total': 20}
), 130000.0: Counter(
{'Total': 2996}
), 50130000.0: Counter(
{'Total': 3082, 'submitted': 7}
), 80000.0: Counter(
{'Total': 17001, 'submitted': 13}
)}
2017-05-14 01:41:06,830:140650691544832:INFO:JobSubmitterPoller:Have 5 packages to submit.
2017-05-14 01:41:06,830:140650691544832:INFO:JobSubmitterPoller:Have 20 jobs to submit.
2017-05-14 01:41:06,830:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 01:41:08,377:140650691544832:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 01:41:08,391:140650691544832:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 20/0.
2017-05-14 01:41:08,489:140650691544832:INFO:DashboardReporter:Handling 20 jobs
2017-05-14 01:41:28,370:140650691544832:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 01:43:30,633:140650691544832:INFO:JobSubmitterPoller:Refreshing priority cache with currently 23090 jobs
2017-05-14 01:43:35,321:140650691544832:INFO:JobSubmitterPoller:Found 23710 new jobs to be submitted.
2017-05-14 01:43:35,322:140650691544832:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 01:43:35,837:140650691544832:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 01:43:38,314:140650691544832:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 294}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 319}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 3212, 'submitted': 2}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 368}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 2905}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 1176}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 101}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 693}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 1118}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1883, 'submitted': 4}
), 'T2_UK_SGrid_RALPP': Counter(
{'submitted': 4}
), 'T2_FR_GRIF_LLR': Counter(
{'submitted': 2}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 134}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 227}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 19}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 253}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 1}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 2028}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 2265}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 2515}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 2361}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 7, 'submitted': 2}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 1897}
), 'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 7}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 169}
), 'T1_DE_KIT': Counter(
{'NoPendingSlot': 772}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 58}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 3301}
)}
2017-05-14 01:43:38,316:140650691544832:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 2}
), 50090000.0: Counter(
{'Total': 9}
), 90000.0: Counter(
{'Total': 20}
), 130000.0: Counter(
{'Total': 3589}
), 50130000.0: Counter(
{'Total': 3082, 'submitted': 6}
), 80000.0: Counter(
{'Total': 17008, 'submitted': 8}
)}
2017-05-14 01:43:38,316:140650691544832:INFO:JobSubmitterPoller:Have 3 packages to submit.
2017-05-14 01:43:38,316:140650691544832:INFO:JobSubmitterPoller:Have 14 jobs to submit.
2017-05-14 01:43:38,316:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 01:43:39,862:140650691544832:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 01:43:39,888:140650691544832:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 14/0.
2017-05-14 01:43:39,947:140650691544832:INFO:DashboardReporter:Handling 14 jobs
2017-05-14 01:43:53,895:140650691544832:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 01:45:55,704:140650691544832:INFO:JobSubmitterPoller:Refreshing priority cache with currently 23696 jobs
2017-05-14 01:46:00,058:140650691544832:INFO:JobSubmitterPoller:Found 26961 new jobs to be submitted.
2017-05-14 01:46:00,059:140650691544832:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 01:46:04,227:140650691544832:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 01:46:07,211:140650691544832:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'NoPendingSlot': 2361}
), 'T2_FR_CCIN2P3': Counter(
{'NoPendingSlot': 2025}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 5933, 'submitted': 3}
), 'T1_ES_PIC': Counter(
{'NoPendingSlot': 2366}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 4608}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 2872, 'submitted': 9}
), 'T2_FR_GRIF_IRFU': Counter(
{'NoPendingSlot': 1802}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 2362, 'submitted': 22}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 3137}
), 'T2_UK_London_IC': Counter(
{'NoPendingSlot': 1834}
), u'T2_FR_GRIF_LLR': Counter(
{'NoPendingSlot': 1959, 'submitted': 41}
), 'T2_US_Nebraska': Counter(
{'NoPendingSlot': 4083, 'submitted': 3}
), 'T2_IT_Legnaro': Counter(
{'NoPendingSlot': 1936}
), 'T2_IT_Bari': Counter(
{'NoPendingSlot': 1717}
), 'T2_ES_CIEMAT': Counter(
{'NoPendingSlot': 1924, 'submitted': 2}
), 'T2_FR_IPHC': Counter(
{'NoPendingSlot': 1960}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 3711}
), u'T2_US_Wisconsin': Counter(
{'NoPendingSlot': 1526, 'submitted': 136}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 4267}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 4221, 'submitted': 255}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 3350, 'submitted': 80}
), 'T2_DE_RWTH': Counter(
{'NoPendingSlot': 1319, 'submitted': 144}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 4614}
), 'T2_UK_London_Brunel': Counter(
{'NoPendingSlot': 1716}
), 'T2_BE_IIHE': Counter(
{'NoPendingSlot': 2128}
), 'T1_DE_KIT': Counter(
{'NoPendingSlot': 2460, 'submitted': 7}
), 'T2_IT_Pisa': Counter(
{'NoPendingSlot': 1764}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 5445, 'submitted': 188}
)}
2017-05-14 01:46:07,212:140650691544832:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 4}
), 50090000.0: Counter(
{'Total': 9}
), 90000.0: Counter(
{'Total': 20}
), 130000.0: Counter(
{'Total': 4663}
), 50130000.0: Counter(
{'Total': 3093, 'submitted': 6}
), 80000.0: Counter(
{'Total': 19172, 'submitted': 884}
)}
2017-05-14 01:46:07,212:140650691544832:INFO:JobSubmitterPoller:Have 7 packages to submit.
2017-05-14 01:46:07,213:140650691544832:INFO:JobSubmitterPoller:Have 890 jobs to submit.
2017-05-14 01:46:07,213:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.
2017-05-14 01:46:30,948:140650691544832:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-05-14 01:46:31,500:140650691544832:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 890/0.
2017-05-14 01:46:34,733:140650691544832:INFO:DashboardReporter:Handling 890 jobs
2017-05-14 01:48:37,485:140650691544832:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-05-14 01:50:41,311:140650691544832:INFO:JobSubmitterPoller:Refreshing priority cache with currently 26071 jobs
2017-05-14 01:50:46,388:140650691544832:INFO:JobSubmitterPoller:Found 27284 new jobs to be submitted.
2017-05-14 01:50:46,389:140650691544832:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-05-14 01:50:48,323:140650691544832:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-05-14 01:50:50,809:140650691544832:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter(
{'submitted': 267}
), 'T2_FR_CCIN2P3': Counter(
{'submitted': 4}
), 'T1_DE_KIT': Counter(
{'NoPendingSlot': 395}
), 'T2_CH_CERN_HLT': Counter(
{'NoPendingSlot': 2790, 'submitted': 51}
), 'T1_ES_PIC': Counter(
{'submitted': 2}
), 'T1_FR_CCIN2P3': Counter(
{'NoPendingSlot': 1397, 'submitted': 5}
), 'T2_US_Florida': Counter(
{'NoPendingSlot': 923, 'submitted': 44}
), 'T2_FR_GRIF_IRFU': Counter(
{'submitted': 2}
), 'T2_US_UCSD': Counter(
{'NoPendingSlot': 289, 'submitted': 59}
), 'T2_US_Purdue': Counter(
{'NoPendingSlot': 83, 'submitted': 51}
), 'T2_FR_GRIF_LLR': Counter(
{'submitted': 1}
), 'T2_US_Nebraska': Counter(
{'submitted': 63}
), 'T2_IT_Legnaro': Counter(
{'submitted': 1}
), 'T2_IT_Bari': Counter(
{'submitted': 2}
), 'T2_ES_CIEMAT': Counter(
{'submitted': 3}
), 'T1_RU_JINR': Counter(
{'NoPendingSlot': 891, 'submitted': 35}
), 'T2_DE_DESY': Counter(
{'NoPendingSlot': 1467}
), 'T1_US_FNAL': Counter(
{'NoPendingSlot': 2480, 'submitted': 37}
), 'T2_US_MIT': Counter(
{'NoPendingSlot': 1053, 'submitted': 23}
), 'T2_DE_RWTH': Counter(
{'submitted': 7}
), 'T2_US_Caltech': Counter(
{'NoPendingSlot': 2340, 'submitted': 25}
), 'T2_BE_IIHE': Counter(
{'submitted': 36, 'NoPendingSlot': 7}
), 'T2_IT_Pisa': Counter(
{'submitted': 12}
), 'T2_CH_CERN': Counter(
{'NoPendingSlot': 2613, 'submitted': 270}
)}
2017-05-14 01:50:50,810:140650691544832:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter(
{'Total': 8, 'submitted': 7}
), 50090000.0: Counter(
{'Total': 9}
), 90000.0: Counter(
{'Total': 20}
), 130000.0: Counter(
{'Total': 4664, 'submitted': 562}
), 50130000.0: Counter(
{'Total': 3118, 'submitted': 247}
), 80000.0: Counter(
{'Total': 4985, 'submitted': 184}
)}
2017-05-14 01:50:50,811:140650691544832:INFO:JobSubmitterPoller:Have 117 packages to submit.
2017-05-14 01:50:50,811:140650691544832:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-05-14 01:50:50,811:140650691544832:INFO:JobSubmitterPoller:Done assigning site locations.

@amaltaro
Copy link
Contributor

These logs look completely normal to me. If it's about JobSubmitter getting stuck to submit jobs, then I may have an idea on how to handle that. I'll try to work on it tomorrow.

@ticoann
Copy link
Contributor

ticoann commented Jan 10, 2018

closing it, timeout and retry clone job is added.

@ticoann ticoann closed this as completed Jan 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants