Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSubmitter silently dying on vocms0304 and vocms0303 #7794

Closed
scarletnorberg opened this issue Apr 19, 2017 · 16 comments
Closed

JobSubmitter silently dying on vocms0304 and vocms0303 #7794

scarletnorberg opened this issue Apr 19, 2017 · 16 comments
Assignees
Milestone

Comments

@scarletnorberg
Copy link

JobSubmitter has gone down so many times. This is like the problem we had before were WMstats is not picking it up and I posted something about this on the other Jobsubmitter issue : #7418 but I am not sure if it is the same thing or not.

Here is the log from the last time I have restarted it:

2017-04-19 04:38:37,603:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:38:38,507:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:38:38,887:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:38:38,888:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:38:40,185:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:38:40,630:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:40:41,182:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:40:42,461:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:40:42,462:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:42:44,205:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:42:45,550:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:42:45,551:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:44:46,888:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:44:48,297:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:44:48,298:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:46:48,900:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:46:49,910:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:47:26,023:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 261, 'submitted': 3}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 604, 'submitted': 126}), 'T1_DE_KIT': Counter({'NoPendingSlot': 38, 'submitted': 37}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 628, 'submitted': 203}), 'T1_FR_CCIN2P3': Counter({'submitted': 3}), 'T2_US_Florida': Counter({'NoPendingSlot': 761, 'submitted': 15}), 'T2_FR_GRIF_IRFU': Counter({'submitted': 15}), 'T2_US_Purdue': Counter({'NoPendingSlot': 1470, 'submitted': 292}), 'T0_CH_CERN': Counter({'NoPendingSlot': 1807, 'submitted': 41}), 'T2_UK_London_IC': Counter({'submitted': 15}), 'T2_FR_GRIF_LLR': Counter({'submitted': 15}), 'T2_US_Nebraska': Counter({'submitted': 27}), 'T2_IT_Legnaro': Counter({'submitted': 1}), 'T2_US_UCSD': Counter({'submitted': 6}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 551, 'submitted': 1}), 'T2_FR_IPHC': Counter({'submitted': 3}), 'T1_RU_JINR': Counter({'NoPendingSlot': 1350, 'submitted': 25}), 'T1_ES_PIC': Counter({'NoPendingSlot': 1125, 'submitted': 3}), 'T1_UK_RAL': Counter({'NoPendingSlot': 538, 'submitted': 4}), 'T1_US_FNAL': Counter({'NoPendingSlot': 7739, 'submitted': 25}), 'T2_US_MIT': Counter({'NoPendingSlot': 2497, 'submitted': 40}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 2401, 'submitted': 12}), 'T2_US_Caltech': Counter({'NoPendingSlot': 697, 'submitted': 6}), 'T2_UK_London_Brunel': Counter({'submitted': 43}), 'T2_BE_IIHE': Counter({'submitted': 16}), 'T2_IT_Pisa': Counter({'submitted': 17}), 'T2_CH_CERN': Counter({'NoPendingSlot': 636, 'submitted': 6})}
2017-04-19 04:47:26,993:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13, 'submitted': 13}), 50090000.0: Counter({'Total': 30, 'submitted': 23}), 90000.0: Counter({'Total': 439, 'submitted': 45}), 130000.0: Counter({'Total': 1987, 'submitted': 15}), 85000.0: Counter({'Total': 20983, 'submitted': 886}), 50085000.0: Counter({'Total': 9, 'submitted': 8}), 50130000.0: Counter({'Total': 10, 'submitted': 10})}
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 43 packages to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.

@ticoann
Copy link
Contributor

ticoann commented Apr 19, 2017

@scarletnorberg, What was the error message when it crashed?

@amaltaro
Copy link
Contributor

What you wanted to say is that JobSubmitter is silently dying (is the process really dead?) on vocms0304.
For instance, there is almost 3h of gap in the logs today

2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:46:48,900:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:46:49,910:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:47:26,023:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 261, 'submitted': 3}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 604, 'submitted': 126}), 'T1_DE_KIT': Cou
nter({'NoPendingSlot': 38, 'submitted': 37}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 628, 'submitted': 203}), 'T1_FR_CCIN2P3': Counter({'submitted': 3}), 'T2_US_Florida': Counter({'NoPendingSlot': 761, 'submitted': 15}), 'T2_FR_GRI
F_IRFU': Counter({'submitted': 15}), 'T2_US_Purdue': Counter({'NoPendingSlot': 1470, 'submitted': 292}), 'T0_CH_CERN': Counter({'NoPendingSlot': 1807, 'submitted': 41}), 'T2_UK_London_IC': Counter({'submitted': 15}), 'T2_FR_GRIF_LLR': C
ounter({'submitted': 15}), 'T2_US_Nebraska': Counter({'submitted': 27}), 'T2_IT_Legnaro': Counter({'submitted': 1}), 'T2_US_UCSD': Counter({'submitted': 6}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 551, 'submitted': 1}), 'T2_FR_IPHC':
 Counter({'submitted': 3}), 'T1_RU_JINR': Counter({'NoPendingSlot': 1350, 'submitted': 25}), 'T1_ES_PIC': Counter({'NoPendingSlot': 1125, 'submitted': 3}), 'T1_UK_RAL': Counter({'NoPendingSlot': 538, 'submitted': 4}), 'T1_US_FNAL': Coun
ter({'NoPendingSlot': 7739, 'submitted': 25}), 'T2_US_MIT': Counter({'NoPendingSlot': 2497, 'submitted': 40}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 2401, 'submitted': 12}), 'T2_US_Caltech': Counter({'NoPendingSlot': 697, 'submitted':
 6}), 'T2_UK_London_Brunel': Counter({'submitted': 43}), 'T2_BE_IIHE': Counter({'submitted': 16}), 'T2_IT_Pisa': Counter({'submitted': 17}), 'T2_CH_CERN': Counter({'NoPendingSlot': 636, 'submitted': 6})}
2017-04-19 04:47:26,993:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13, 'submitted': 13}), 50090000.0: Counter({'Total': 30, 'submitted': 23}), 90000.0: Counter({'Total': 439, 'submit
ted': 45}), 130000.0: Counter({'Total': 1987, 'submitted': 15}), 85000.0: Counter({'Total': 20983, 'submitted': 886}), 50085000.0: Counter({'Total': 9, 'submitted': 8}), 50130000.0: Counter({'Total': 10, 'submitted': 10})}
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 43 packages to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 07:11:19,877:139850334803712:INFO:Harness:>>>Starting: JobSubmitter<<<

Next time it happens, please do not restart the component and let us know.

@scarletnorberg
Copy link
Author

@amaltaro @ticoann It is down again. It does not have a list of what is down. It just says down I assume it is because of this problem but do not want to restart it or touch it. Please have a look JobUpdater was just down on this machine here is the JIRA ticket: https://its.cern.ch/jira/projects/CMSCOMPPR/issues/CMSCOMPPR-825?filter=allopenissues
I will not touch it. I hope someone can look at it and let me know what is going on. Thanks.

@vlimant
Copy link
Contributor

vlimant commented Apr 19, 2017

@ticoann @amaltaro looks like the schedd has been having troubles. I think the component should not die on this, but raise a warning that it cannot do its job

@ticoann
Copy link
Contributor

ticoann commented Apr 20, 2017

Although schedd connection failed we don't crash the component on that. (Alan already added that patch)

2017-04-19 14:29:27,186:139849995843328:ERROR:SimpleCondorPlugin:Failed to connect to schedd.
Traceback (most recent call last):
  File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 146, in submit
    clusterId = schedd.submitMany(clusterAd, procAds)
RuntimeError: Failed to connect to schedd.
2017-04-19 14:29:27,432:139849995843328:ERROR:SimpleCondorPlugin:SimpleCondorPlugin job submission failed.
2017-04-19 14:29:27,433:139849995843328:ERROR:SimpleCondorPlugin:Moving on the the next batch of jobs and/or cycle....
2017-04-19 14:29:27,433:139849995843328:ERROR:SimpleCondorPlugin:Failed to connect to schedd.

However, thread seems to be killed silently, Here is the last part of the log.

2017-04-19 14:40:04,744:139849995843328:INFO:JobSubmitterPoller:Have 17 packages to submit.
2017-04-19 14:40:04,744:139849995843328:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-04-19 14:40:04,744:139849995843328:INFO:JobSubmitterPoller:Done assigning site locations.

It looks like a couple of times schedd connection failed and last try just hangs without explicit failing then killing the thread.

I change the log level to debug and restarted the component.

@amaltaro
Copy link
Contributor

amaltaro commented May 2, 2017

vocms0304 JobSubmitter

@amaltaro amaltaro added this to the WMAgent1705 milestone May 2, 2017
@scarletnorberg
Copy link
Author

vomcs 308:

2017-04-26 09:59:15,938:140351529064192:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter({'Total': 2, 'submitted': 2}), 90000.0: Counter({'Total': 4, 'submitted': 4}), 80000.0: Counter({'Total': 4834, 'submitted': 50})}
2017-04-26 09:59:15,938:140351529064192:INFO:JobSubmitterPoller:Have 8 packages to submit.
2017-04-26 09:59:15,939:140351529064192:INFO:JobSubmitterPoller:Have 56 jobs to submit.
2017-04-26 09:59:15,939:140351529064192:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-26 09:59:16,745:140351529064192:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-04-26 09:59:16,777:140351529064192:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 56/0.
2017-04-26 09:59:17,021:140351529064192:INFO:DashboardReporter:Handling 56 jobs
2017-04-26 09:59:28,519:140351529064192:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-04-26 10:01:28,757:140351529064192:INFO:JobSubmitterPoller:Refreshing priority cache with currently 4784 jobs
2017-04-26 10:01:32,105:140351529064192:INFO:JobSubmitterPoller:Found 4873 new jobs to be submitted.
2017-04-26 10:01:32,105:140351529064192:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-26 10:01:32,208:140351529064192:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-26 10:01:36,456:140351529064192:INFO:JobSubmitterPoller:Site submission report: {'T2_DE_RWTH': Counter({'NoPendingSlot': 1744, 'submitted': 25}), 'T2_US_UCSD': Counter({'submitted': 1}), 'T2_CH_CERN_HLT': Counter({'submitted': 2}), 'T2_DE_DESY': Counter({'submitted': 9}), 'T1_US_FNAL': Counter({'submitted': 3}), 'T2_US_MIT': Counter({'submitted': 12}), 'T2_IT_Legnaro': Counter({'submitted': 1}), 'T2_US_Caltech': Counter({'NoPendingSlot': 1512, 'submitted': 34}), 'T2_US_Florida': Counter({'submitted': 11}), 'T2_US_Purdue': Counter({'submitted': 11}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 1487, 'submitted': 20}), 'T2_CH_CERN': Counter({'submitted': 1})}
2017-04-26 10:01:36,456:140351529064192:INFO:JobSubmitterPoller:Priority submission report: {50090000.0: Counter({'Total': 5, 'submitted': 5}), 50080000.0: Counter({'Total': 8, 'submitted': 8}), 90000.0: Counter({'Total': 17, 'submitted': 17}), 85000.0: Counter({'Total': 1, 'submitted': 1}), 80000.0: Counter({'Total': 4842, 'submitted': 99})}
2017-04-26 10:01:36,457:140351529064192:INFO:JobSubmitterPoller:Have 14 packages to submit.
2017-04-26 10:01:36,457:140351529064192:INFO:JobSubmitterPoller:Have 130 jobs to submit.
2017-04-26 10:01:36,457:140351529064192:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-26 10:01:38,025:140351529064192:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin
2017-04-26 10:01:38,119:140351529064192:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 130/0.
2017-04-26 10:01:38,512:140351529064192:INFO:DashboardReporter:Handling 130 jobs
2017-04-26 10:01:50,442:140351529064192:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2017-04-26 10:19:21,311:140351529064192:ERROR:pool:Exception closing connection <cx_Oracle.Connection to CMS_WMBS_PROD1@CMSR>
Traceback (most recent call last):
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/pool.py", line 244, in _close_connection
self._dialect.do_close(connection)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 411, in do_close
dbapi_connection.close()
OperationalError: ORA-03114: not connected to ORACLE

2017-04-26 10:19:21,337:140351529064192:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
<WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller instance at 0x7fa62bff6200> (OperationalError) ORA-03113: end-of-file on communication channel
Process ID: 95494
Session ID: 3320 Serial number: 2177
'SELECT component_id FROM wm_workers\n WHERE component_id = (SELECT id FROM wm_components\n WHERE name = :component_name)\n AND name = :worker_name\n ' {'worker_name': 'JobSubmitterPoller', 'component_name': 'JobSubmitter'} File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 172, in call
myThread.getName(), "Running")
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/Agent/HeartbeatAPI.py", line 47, in updateWorkerHeartbeat
transaction = self.existingTransaction())
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/Agent/Database/MySQL/ExistWorker.py", line 37, in execute
transaction = transaction)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/Database/DBCore.py", line 179, in processData
returnCursor=returnCursor)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/Database/DBCore.py", line 65, in executebinds
resultProxy = connection.execute(s, b)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 719, in execute
return self._execute_text(object, multiparams, params)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 868, in _execute_text
statement, parameters
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
context)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception
exc_info
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
context)
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/external/py2-sqlalchemy/0.9.6-comp3/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
cursor.execute(statement, parameters)

2017-04-26 10:19:21,338:140351529064192:INFO:Harness:>>>Terminating worker threads
2017-04-26 10:19:21,439:140351529064192:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller instance at 0x7fa62bff6200> (OperationalError) ORA-03113: end-of-file on communication channel
Process ID: 95494
Session ID: 3320 Serial number: 2177
'SELECT component_id FROM wm_workers\n WHERE component_id = (SELECT id FROM wm_components\n WHERE name = :component_name)\n AND name = :worker_name\n ' {'worker_name': 'JobSubmitterPoller', 'component_name': 'JobSubmitter'}
Backtrace:
File "/data/srv/wmagent/v1.1.0.patch2/sw/slc6_amd64_gcc493/cms/wmagent/1.1.0.patch2/lib/python2.7/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 205, in call
raise ex

2017-04-26 10:19:21,440:140351529064192:INFO:BaseWorkerThread:Worker thread <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller instance at 0x7fa62bff6200> terminated

JobSubmitter this is the same issue that has been seen.

@amaltaro
Copy link
Contributor

amaltaro commented May 4, 2017

Scarlet, the last error report you pasted is related to the oracle CMSR intervention that brought all the cms computing software down.
The initial issue reported here was about JobSubmitter silently dying (either process running with some sort of lock or component + process down, still to be figured).

If you have components going down for different reasons (like your last comment), please either create a new issue or follow it up in an already opened one.

@amaltaro amaltaro changed the title vocms0304 JobSubmitter JobSubmitter silently dying on vocms0304 and vocms0303 May 18, 2017
@amaltaro
Copy link
Contributor

======== Better description of the subject ===========
The JobSubmitter process is really no longer running, so the component is not stuck, but it's dead for real.

Poking through the component logs, these are my findings.
stderr.log

...
INFO:root:Have 890 jobs to submit.
INFO:root:Done assigning site locations.
terminate called after throwing an instance of 'boost::python::error_already_set'

stdout.log (this seems to be a SIGABRT received and unhandled in the client (?))

...
Wed May 17 20:29:15 2017: ApMon[NOTICE]: Adding parameter taskId = wmagent_pdmvserv_task_SMP-RunIISummer15wmLHEGS-00116__v1_T_170512_165520_5450
Wed May 17 20:29:15 2017: ApMon[NOTICE]: Building XDR packet [wmagent_pdmvserv_task_SMP-RunIISummer15wmLHEGS-00116__v1_T_170512_165520_5450/cd6a0f88-3850-11e7-b9af-002590494c06-5_0] <1000/32207389> 8 params, 588 bytes.
Wed May 17 20:29:15 2017: ApMon[NOTICE]: Packet sent to 128.142.166.92:8884 
Wed May 17 20:29:15 2017: ApMon[DEBUG]: Closing UDP socket on ApMon object destroy.
ORA-24550: signal received: [si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x7cd90001c972]
kpedbg_dmp_stack()+362<-kpeDbgCrash()+192<-kpeDbgSignalHandler()+119<-skgesig_sigactionHandler()+218<-__sighandler()<-gsignal()+53<-_ZN10__cxxabiv111__terminateEPFvvE()+18<-__cxa_call_terminate()+82<-__gxx_personality_v0()+1350<-_Unwind_RaiseException_Phase2()+67

ComponentLog (nothing useful here)

...
2017-05-17 20:34:00,595:140446217037568:INFO:JobSubmitterPoller:Have 890 jobs to submit.
2017-05-17 20:34:00,596:140446217037568:INFO:JobSubmitterPoller:Done assigning site locations.

which doesn't bring me to any conclusions. The unhandled signal came in 5min before the last logging in the ComponentLog, so that's not crystal clear whether the component stopped because of the oracle signal.

@amaltaro amaltaro modified the milestones: WMAgent1706, WMAgent1705 Jun 2, 2017
amaltaro referenced this issue in amaltaro/ProductionTools Jun 26, 2017
@ticoann ticoann modified the milestones: WMAgent1706, WMAgent1707 Jul 4, 2017
@amaltaro amaltaro modified the milestones: WMAgent1708, WMAgent1707 Aug 1, 2017
@amaltaro
Copy link
Contributor

amaltaro commented Aug 5, 2017

Same component went down on vocms0159 (1.1.4.patch2), no logs at all, last log is

2017-08-05 03:21:16,983:140126486722304:INFO:DashboardReporter:Handling 4 jobs
2017-08-05 03:21:17,991:140126486722304:INFO:JobSubmitterPoller:Transaction cycle successfully completed.

and wmstats does not report JobSubmitter down

    agent: vocms0159.cern.ch (1.1.4.patch2)
    agent last updated: 2017/8/5 (Sat) 06:39:39 UTC : 0 h 5 m
    data last updated: N/A
    status: Components or Thread down;
    team: testbed-vocms0159

I'll try to work on this component monitoring the coming week. It's for way too long in my todo list.

@amaltaro amaltaro self-assigned this Aug 5, 2017
@amaltaro amaltaro modified the milestones: WMAgent1709, WMAgent1708 Sep 12, 2017
@ticoann ticoann modified the milestones: WMAgent1709, WMAgent1710 Oct 24, 2017
@ticoann ticoann modified the milestones: WMAgent1710, WMAgent1712 Nov 27, 2017
@ticoann ticoann removed this from the WMAgent1712 milestone Jan 18, 2018
@ticoann ticoann added this to the WMAgent1801 milestone Jan 18, 2018
@ticoann ticoann modified the milestones: WMAgent1801, WMAgent1802 Feb 12, 2018
@scarletnorberg
Copy link
Author

https://its.cern.ch/jira/browse/CMSCOMPPR-2353
Jean-Roch noticed that 258 JobAccountant has silently crashed
We have not seen this in a while. It seems to still be happening.

@vlimant
Copy link
Contributor

vlimant commented Feb 15, 2018

Apologies, I did not keep the log, but there was indeed something like "connection refused" in there dated Feb 14 2018.
Is this supposed to be fixed the recently announced release that is resilient to oracle @amaltaro ?

@ticoann
Copy link
Contributor

ticoann commented Feb 16, 2018

I see this in the log

2018-02-14 14:15:40,518:140624510367488:ERROR:pool:Exception closing connection <cx_Oracle.Connection to CMS_WMBS_PROD8@CMSR>
OperationalError: ORA-03114: not connected to ORACLE
(OperationalError) ORA-03113: end-of-file on communication channel
2018-02-14 14:15:40,521:140624510367488:ERROR:pool:Exception closing connection <cx_Oracle.Connection to CMS_WMBS_PROD8@CMSR>
OperationalError: ORA-03113: end-of-file on communication channel

Those errors will be handled in the new agent 1.1.0.patch2 which will be starting to deploy next week.

One thing I don't understand is why WMStats doesn't report this cases. At least it should report heartbeat error.

@ticoann
Copy link
Contributor

ticoann commented Feb 16, 2018

Closing this since the patch is already in the master and wmagent branch.

@ticoann ticoann closed this as completed Feb 16, 2018
@amaltaro
Copy link
Contributor

Since it was JobAccountant - and not JobSubmitter, as the subject suggests - it might be related to this other issue: #8365

@ticoann
Copy link
Contributor

ticoann commented Feb 16, 2018

Thank Alan, so we still close this issue and leave #8365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants