-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vocms0310.cern.ch JobSubmitter #7418
Comments
Scarlet, I see no crashes in JobSubmitter in vocms0310. Did you want to report another node/component? |
I'm updating the subject because I'm pretty sure you wanted to report vocms0130. These failures came up from the condor schedd instabilities. The glidein team should have alarms in place for these cases (maybe not actually an alarm, but monitoring in gwmsmon). I can offer you two short-term improvements: I know you're probably going to ask for an alarm here. I insist the alarm should be on the glidein/global pool land. |
I do not think this is exactly related. But remember I said that we had this issue where JobSubmitter was not reporting on WMstats. I think you said it was a sched issue. This has not been happening a lot but has started happening again on vocms0304. I do not remember now what came out of it and I keep forgetting to copy the error I am sure this will happen again. Jean Roch wants me to put it in drain which I will do now. I am not sure if this will continue to happen or start happening on another machine. Can you have a look at this agent? @amaltaro |
Scarlet, are you sure it's vocms0304 and JobSubmitter component? I don't see a single crash in the last week, though there are several schedd unresposive errors in the logs
which I confirm from the numerous HTCondor alerts I received over the holidays... |
When I saw this I was like what holiday but yes I forgot CERN was on Holiday. Yes I think this is exactly related. I feel like last time I talked to you and you did something and this issue just disappeared. Not sure what you did or if you actually did anything but figured I would give it a try again and see happens! I think it is a schedd issue. It does not report an error like normal the component goes down and has no file that is associated with but I do not think an error is seen. More details are in the above linked page. |
303 and 309: Here are their logs: 303 first than 309: |
These logs look completely normal to me. If it's about JobSubmitter getting stuck to submit jobs, then I may have an idea on how to handle that. I'll try to work on it tomorrow. |
closing it, timeout and retry clone job is added. |
I think Jean Roc and I have restarted the same component twice although because elog seems to not be working I can not be 100% sure.
Here is the error:
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/BossAir/Plugins/PyCondorPlugin.py", line 84, in submitWorker
stdout, stderr, returnCode = SubprocessAlgos.runCommand(cmd=command, shell=True, timeout=timeout)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 118, in runCommand
raise SubprocessAlgoException(msg)
SubprocessAlgoException: SubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2772682.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0
Traceback:
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 111, in runCommand
stdout, stderr = pipe.communicate()
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 799, in communicate
return self._communicate(input)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1409, in _communicate
stdout, stderr = self._communicate_with_poll(input)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1463, in _communicate_with_poll
ready = poller.poll()
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
raise Alarm
2016-11-23 17:48:37,731:140302358836992:ERROR:SubprocessAlgos:Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
2016-11-23 17:48:37,734:140302358836992:ERROR:PyCondorPlugin:Critical error in subprocess while submitting to condorSubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0
Traceback:
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 111, in runCommand
stdout, stderr = pipe.communicate()
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 799, in communicate
return self._communicate(input)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1409, in _communicate
stdout, stderr = self._communicate_with_poll(input)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/external/python/2.7.11-comp/lib/python2.7/subprocess.py", line 1463, in _communicate_with_poll
ready = poller.poll()
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
raise Alarm
Traceback (most recent call last):
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/BossAir/Plugins/PyCondorPlugin.py", line 84, in submitWorker
stdout, stderr, returnCode = SubprocessAlgos.runCommand(cmd=command, shell=True, timeout=timeout)
File "/data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py", line 118, in runCommand
raise SubprocessAlgoException(msg)
SubprocessAlgoException: SubprocessAlgoException
Message: Alarm sounded while running command after 400 seconds.
Command: condor_submit /data/srv/wmagent/v1.0.21.patch1/install/wmagent/JobSubmitter/submit_1418300_2723271.jdl
Raising exception
ModuleName : WMCore.Algorithms.SubprocessAlgos
MethodName : runCommand
ClassInstance : None
FileName : /data/srv/wmagent/v1.0.21.patch1/sw/slc6_amd64_gcc493/cms/wmagent/1.0.21.patch1/lib/python2.7/site-packages/WMCore/Algorithms/SubprocessAlgos.py
ClassName : None
LineNumber : 118
ErrorNr : 0
Please check this machine. It is green now but as it has already been restarted twice today I am worried it will have an issue soon.
The text was updated successfully, but these errors were encountered: