-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobSubmitter silently dying on vocms0304 and vocms0303 #7794
Comments
@scarletnorberg, What was the error message when it crashed? |
What you wanted to say is that JobSubmitter is silently dying (is the process really dead?) on vocms0304.
Next time it happens, please do not restart the component and let us know. |
@amaltaro @ticoann It is down again. It does not have a list of what is down. It just says down I assume it is because of this problem but do not want to restart it or touch it. Please have a look JobUpdater was just down on this machine here is the JIRA ticket: https://its.cern.ch/jira/projects/CMSCOMPPR/issues/CMSCOMPPR-825?filter=allopenissues |
Although schedd connection failed we don't crash the component on that. (Alan already added that patch)
However, thread seems to be killed silently, Here is the last part of the log.
It looks like a couple of times schedd connection failed and last try just hangs without explicit failing then killing the thread. I change the log level to debug and restarted the component. |
vomcs 308: 2017-04-26 09:59:15,938:140351529064192:INFO:JobSubmitterPoller:Priority submission report: {50080000.0: Counter({'Total': 2, 'submitted': 2}), 90000.0: Counter({'Total': 4, 'submitted': 4}), 80000.0: Counter({'Total': 4834, 'submitted': 50})} 2017-04-26 10:19:21,337:140351529064192:ERROR:BaseWorkerThread:Error in worker algorithm (1): 2017-04-26 10:19:21,338:140351529064192:INFO:Harness:>>>Terminating worker threads 2017-04-26 10:19:21,440:140351529064192:INFO:BaseWorkerThread:Worker thread <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller instance at 0x7fa62bff6200> terminated JobSubmitter this is the same issue that has been seen. |
Scarlet, the last error report you pasted is related to the oracle CMSR intervention that brought all the cms computing software down. If you have components going down for different reasons (like your last comment), please either create a new issue or follow it up in an already opened one. |
======== Better description of the subject =========== Poking through the component logs, these are my findings.
stdout.log (this seems to be a SIGABRT received and unhandled in the client (?))
ComponentLog (nothing useful here)
which doesn't bring me to any conclusions. The unhandled signal came in 5min before the last logging in the ComponentLog, so that's not crystal clear whether the component stopped because of the oracle signal. |
Same component went down on vocms0159 (1.1.4.patch2), no logs at all, last log is
and wmstats does not report JobSubmitter down
I'll try to work on this component monitoring the coming week. It's for way too long in my todo list. |
https://its.cern.ch/jira/browse/CMSCOMPPR-2353 |
Apologies, I did not keep the log, but there was indeed something like "connection refused" in there dated Feb 14 2018. |
I see this in the log 2018-02-14 14:15:40,518:140624510367488:ERROR:pool:Exception closing connection <cx_Oracle.Connection to CMS_WMBS_PROD8@CMSR> Those errors will be handled in the new agent 1.1.0.patch2 which will be starting to deploy next week. One thing I don't understand is why WMStats doesn't report this cases. At least it should report heartbeat error. |
Closing this since the patch is already in the master and wmagent branch. |
Since it was JobAccountant - and not JobSubmitter, as the subject suggests - it might be related to this other issue: #8365 |
Thank Alan, so we still close this issue and leave #8365 |
JobSubmitter has gone down so many times. This is like the problem we had before were WMstats is not picking it up and I posted something about this on the other Jobsubmitter issue : #7418 but I am not sure if it is the same thing or not.
Here is the log from the last time I have restarted it:
2017-04-19 04:38:37,603:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:38:38,507:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:38:38,887:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:38:38,888:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:38:40,185:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:38:40,630:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:38:40,631:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:40:41,181:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:40:41,182:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:40:42,461:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:40:42,462:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:40:42,463:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:42:44,204:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:42:44,205:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:42:45,550:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:42:45,551:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:42:45,552:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:44:46,888:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:44:46,889:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:44:48,297:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 2301}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 2773}), 'T1_DE_KIT': Counter({'NoPendingSlot': 2119}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 3299}), 'T1_ES_PIC': Counter({'NoPendingSlot': 3117}), 'T1_FR_CCIN2P3': Counter({'NoPendingSlot': 1968}), 'T2_US_Florida': Counter({'NoPendingSlot': 2869}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 2010}), 'T2_US_Purdue': Counter({'NoPendingSlot': 4363}), u'T0_CH_CERN': Counter({'NoPendingSlot': 4625}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 1989}), 'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 1961}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 2350}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 1964}), 'T2_US_UCSD': Counter({'NoPendingSlot': 1971}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 2551}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 1971}), 'T1_RU_JINR': Counter({'NoPendingSlot': 3500}), u'T2_US_Wisconsin': Counter({'NoPendingSlot': 1947}), u'T2_DE_DESY': Counter({'NoPendingSlot': 1155}), u'T1_UK_RAL': Counter({'NoPendingSlot': 2834}), 'T1_US_FNAL': Counter({'NoPendingSlot': 10130}), 'T2_US_MIT': Counter({'NoPendingSlot': 4796}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 4435}), 'T2_US_Caltech': Counter({'NoPendingSlot': 2784}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 2006}), 'T2_BE_IIHE': Counter({'NoPendingSlot': 813}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 1966}), 'T2_CH_CERN': Counter({'NoPendingSlot': 3313})}
2017-04-19 04:44:48,298:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13}), 90000.0: Counter({'Total': 439}), 130000.0: Counter({'Total': 1987}), 50085000.0: Counter({'Total': 9}), 80000.0: Counter({'Total': 1152}), 50130000.0: Counter({'Total': 10}), 50090000.0: Counter({'Total': 30}), 85000.0: Counter({'Total': 26375})}
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 packages to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Have 0 jobs to submit.
2017-04-19 04:44:48,299:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
2017-04-19 04:46:48,900:139680142771968:INFO:JobSubmitterPoller:Refreshing priority cache with currently 30015 jobs
2017-04-19 04:46:49,910:139680142771968:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (30015 job in cache)
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2017-04-19 04:46:49,911:139680142771968:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2017-04-19 04:47:26,023:139680142771968:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'NoPendingSlot': 261, 'submitted': 3}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 604, 'submitted': 126}), 'T1_DE_KIT': Counter({'NoPendingSlot': 38, 'submitted': 37}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 628, 'submitted': 203}), 'T1_FR_CCIN2P3': Counter({'submitted': 3}), 'T2_US_Florida': Counter({'NoPendingSlot': 761, 'submitted': 15}), 'T2_FR_GRIF_IRFU': Counter({'submitted': 15}), 'T2_US_Purdue': Counter({'NoPendingSlot': 1470, 'submitted': 292}), 'T0_CH_CERN': Counter({'NoPendingSlot': 1807, 'submitted': 41}), 'T2_UK_London_IC': Counter({'submitted': 15}), 'T2_FR_GRIF_LLR': Counter({'submitted': 15}), 'T2_US_Nebraska': Counter({'submitted': 27}), 'T2_IT_Legnaro': Counter({'submitted': 1}), 'T2_US_UCSD': Counter({'submitted': 6}), 'T2_ES_CIEMAT': Counter({'NoPendingSlot': 551, 'submitted': 1}), 'T2_FR_IPHC': Counter({'submitted': 3}), 'T1_RU_JINR': Counter({'NoPendingSlot': 1350, 'submitted': 25}), 'T1_ES_PIC': Counter({'NoPendingSlot': 1125, 'submitted': 3}), 'T1_UK_RAL': Counter({'NoPendingSlot': 538, 'submitted': 4}), 'T1_US_FNAL': Counter({'NoPendingSlot': 7739, 'submitted': 25}), 'T2_US_MIT': Counter({'NoPendingSlot': 2497, 'submitted': 40}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 2401, 'submitted': 12}), 'T2_US_Caltech': Counter({'NoPendingSlot': 697, 'submitted': 6}), 'T2_UK_London_Brunel': Counter({'submitted': 43}), 'T2_BE_IIHE': Counter({'submitted': 16}), 'T2_IT_Pisa': Counter({'submitted': 17}), 'T2_CH_CERN': Counter({'NoPendingSlot': 636, 'submitted': 6})}
2017-04-19 04:47:26,993:139680142771968:INFO:JobSubmitterPoller:Priority submission report: {900000.0: Counter({'Total': 13, 'submitted': 13}), 50090000.0: Counter({'Total': 30, 'submitted': 23}), 90000.0: Counter({'Total': 439, 'submitted': 45}), 130000.0: Counter({'Total': 1987, 'submitted': 15}), 85000.0: Counter({'Total': 20983, 'submitted': 886}), 50085000.0: Counter({'Total': 9, 'submitted': 8}), 50130000.0: Counter({'Total': 10, 'submitted': 10})}
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 43 packages to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2017-04-19 04:47:26,994:139680142771968:INFO:JobSubmitterPoller:Done assigning site locations.
The text was updated successfully, but these errors were encountered: