Skip to content
This repository has been archived by the owner on Jan 10, 2019. It is now read-only.

Cassandra tasks getting LOST status in mesos #121

Closed
AndriiOmelianenko opened this issue Jul 17, 2015 · 9 comments
Closed

Cassandra tasks getting LOST status in mesos #121

AndriiOmelianenko opened this issue Jul 17, 2015 · 9 comments

Comments

@AndriiOmelianenko
Copy link

I have deployed DCOS cluster and installed here cassandra and spark.
I'm running spark job on one of masters dcos spark run --submit-args='--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 10' and after it's finishs execution few cassandra executors fail. In mesos it looks like this:

ID                                      Name                                              State Started Stopped Host
driver-20150714085321-0003              Driver for org.apache.spark.examples.SparkPi    FINISHED    2015-07-14T11:53:25+0300    2015-07-14T11:53:35+0300    dcos-slave-03.novalocal 
cassandra.dcos.node.0.executor.server   cassandra.dcos.node                                LOST 2015-07-14T11:30:45+0300    2015-07-14T11:53:40+0300    dcos-slave-01.novalocal 
cassandra.dcos.node.0.executor          cassandra.dcos.node.0.executor                 LOST 2015-07-14T11:30:43+0300    2015-07-14T11:53:40+0300    dcos-slave-01.novalocal

Spark job ran successfully: (stdout)

Registered executor on dcos-slave-02.novalocal
Starting task driver-20150714092001-0005
/bin/sh -c exit `docker wait mesos-80c81d68-6b06-47aa-92b2-70f908b09201` 
Forked command at 26770
Pi is roughly 3.141704
Command exited with status 0 (pid: 26770)

Some cassandra executors can't even stand up after this and keeps geting LOST status every few seconds with next stderr:

/opt/mesosphere/packages/mesos--5018921cbb873aea2a0db00a407d77a8de419f63/libexec/mesos/mesos-fetcher: /lib64/libcurl.so.4: no version information available (required by /opt/mesosphere/packages/mesos--5018921cbb873aea2a0db00a407d77a8de419f63/libexec/mesos/mesos-fetcher)
I0714 09:22:28.348204 27718 logging.cpp:172] INFO level logging started!
I0714 09:22:28.349845 27718 fetcher.cpp:214] Fetching URI 'http://<dcos-slave-01 IP address>:10000/jre-7-linux.tar.gz'
I0714 09:22:28.349892 27718 fetcher.cpp:125] Fetching URI 'http://<dcos-slave-01 IP address>:10000/jre-7-linux.tar.gz' with os::net
I0714 09:22:28.349912 27718 fetcher.cpp:135] Downloading 'http://<dcos-slave-01 IP address>:10000/jre-7-linux.tar.gz' to '/var/lib/mesos/slave/slaves/20150713-143901-787031981-5050-14056-S6/frameworks/20150713-143901-787031981-5050-14056-0008/executors/cassandra.dcos.node.2.executor/runs/ee26d1d9-9bac-4f1d-9cb2-cc566f5faa99/jre-7-linux.tar.gz'
E0714 09:22:28.350571 27718 fetcher.cpp:138] Error downloading resource: Couldn't connect to server
Failed to fetch: http://<dcos-slave-01 IP address>:10000/jre-7-linux.tar.gz
Failed to synchronize with slave (it's probably exited)

Can anyone help me with this?

@BenWhitehead
Copy link
Contributor

thanks @AndriiOmelianenko I'll try and reproduce this in the next couple days and get back to you, I'm not sure why a spark job would kill off one of the cassandra executors.

@AndriiOmelianenko
Copy link
Author

@BenWhitehead thanks.
Also, I noticed, that same status receives not only cassandra tasks, but hdfs and kafka too. Some tasks receives LOST status and some FAILED.

I'm running DCOS cluster in OpenStack environment, and everything works good until I run Spark job :)

@AndriiOmelianenko
Copy link
Author

Logs of cassandra/hdfs/kafka frameworks doesn't say anything good, they are ending with successful Framework registered with 20150727-115610

there is /var/log/messages of first of slaves:

Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.135936 11877 slave.cpp:2709] Sending acknowledgement for status update TASK_RUNNING (UUID: 5c56d66b-df19-4bbe-82a8-ab112d779b50) for task 93 of framework 201
50727-115610-2703018762-5050-13855-0007 to executor(1)@<slave-01-IP>:42072
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.138185 11876 slave.cpp:2531] Handling status update TASK_FINISHED (UUID: 28ea5797-03ab-4b27-bdce-efa882fa93d5) for task 93 of framework 20150727-115610-27030
18762-5050-13855-0007 from executor(1)@<slave-01-IP>:42072
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.138767 11876 docker.cpp:1009] Updated 'cpu.shares' to 1024 at /sys/fs/cgroup/cpu,cpuacct/system.slice/docker-3f0c95b0ba0bdad1fb92821c33a13a7ef4b6fb4305d080b1
711dab66c668ea0f.scope for container 4354a03a-e7ff-45c9-855c-7fcf813002e5
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.139292 11876 docker.cpp:1044] Updated 'memory.soft_limit_in_bytes' to 896MB for container 4354a03a-e7ff-45c9-855c-7fcf813002e5
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.140612 11881 status_update_manager.cpp:317] Received status update TASK_FINISHED (UUID: 28ea5797-03ab-4b27-bdce-efa882fa93d5) for task 93 of framework 201507
27-115610-2703018762-5050-13855-0007
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.140771 11876 slave.cpp:2709] Sending acknowledgement for status update TASK_FINISHED (UUID: 28ea5797-03ab-4b27-bdce-efa882fa93d5) for task 93 of framework 20
150727-115610-2703018762-5050-13855-0007 to executor(1)@<slave-01-IP>:42072
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.144865 11878 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 5c56d66b-df19-4bbe-82a8-ab112d779b50) for task 93 of framework 2015
0727-115610-2703018762-5050-13855-0007
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.145045 11878 slave.cpp:2776] Forwarding the update TASK_FINISHED (UUID: 28ea5797-03ab-4b27-bdce-efa882fa93d5) for task 93 of framework 20150727-115610-270301
8762-5050-13855-0007 to master@<master-01-IP>:5050
Jul 27 12:24:41 localhost mesos-slave: I0727 12:24:41.150296 11881 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 28ea5797-03ab-4b27-bdce-efa882fa93d5) for task 93 of framework 2015
0727-115610-2703018762-5050-13855-0007
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.291115 11875 slave.cpp:1768] Asked to shut down framework 20150727-115610-2703018762-5050-13855-0007 by master@<master-01-IP>:5050
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.291427 11875 slave.cpp:1793] Shutting down framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.292618 11875 slave.cpp:3473] Shutting down executor '20150727-115610-2703018762-5050-13855-S2' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.294889 11880 slave.cpp:3543] Killing executor '20150727-115610-2703018762-5050-13855-S2' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.295073 11880 docker.cpp:1212] Destroying container '4354a03a-e7ff-45c9-855c-7fcf813002e5'
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.295524 11880 docker.cpp:1274] Sending SIGTERM to executor with pid: 13791
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.323859 11880 docker.cpp:1319] Running docker stop on container '4354a03a-e7ff-45c9-855c-7fcf813002e5'
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.337043 11881 docker.cpp:1404] Executor for container '4354a03a-e7ff-45c9-855c-7fcf813002e5' has exited
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="POST /v1.18/containers/mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5/stop?t=0"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="+job stop(mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="Failed to send SIGTERM to the process, force killing"
Jul 27 12:24:59 localhost docker: Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job stop(mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5) = ERR (1)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=error msg="Handler for POST /containers/{name:.*}/stop returned error: Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=error msg="HTTP Error: statusCode=500 Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="+job log(die, 3f0c95b0ba0bdad1fb92821c33a13a7ef4b6fb4305d080b1711dab66c668ea0f, docker.io/mesosphere/spark:1.4.0-rc4-hdfs)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job log(die, 3f0c95b0ba0bdad1fb92821c33a13a7ef4b6fb4305d080b1711dab66c668ea0f, docker.io/mesosphere/spark:1.4.0-rc4-hdfs) = OK (0)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job logs(mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5) = OK (0)"
Jul 27 12:24:59 localhost mesos-slave: E0727 12:24:59.438140 11874 slave.cpp:3207] Termination of executor '20150727-115610-2703018762-5050-13855-S2' of framework '20150727-115610-2703018762-5050-13855-0007' failed: Failed to kill the Docker container: Failed to 'docker stop -t 0 mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5': exit status = exited with status 1 stderr = Error response from daemon: Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed
Jul 27 12:24:59 localhost mesos-slave: time="2015-07-27T12:24:59Z" level=fatal msg="Error: failed to stop one or more containers"
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.438586 11874 slave.cpp:3332] Cleaning up executor '20150727-115610-2703018762-5050-13855-S2' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.439330 11874 slave.cpp:3411] Cleaning up framework 20150727-115610-2703018762-5050-13855-0007

mesos-slave.ERROR of slave-01

Log file created at: 2015/07/27 12:24:59
Running on machine: slave-01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0727 12:24:59.438140 11874 slave.cpp:3207] Termination of executor '20150727-115610-2703018762-5050-13855-S2' of framework '20150727-115610-2703018762-5050-13855-0007' failed: Failed to kill the Docker container: Failed to 'docker stop -t 0 mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5': exit status = exited with status 1 stderr = Error response from daemon: Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed
time="2015-07-27T12:24:59Z" level=fatal msg="Error: failed to stop one or more containers"

mesos-slave.WARNING of slave-01

Log file created at: 2015/07/27 12:06:13
Running on machine: slave-01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0727 12:06:13.638355 11875 mem.cpp:215] Cgroup root for memory disappeared!
E0727 12:24:59.438140 11874 slave.cpp:3207] Termination of executor '20150727-115610-2703018762-5050-13855-S2' of framework '20150727-115610-2703018762-5050-13855-0007' failed: Failed to kill the Docker container: Failed to 'docker stop -t 0 mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5': exit status = exited with status 1 stderr = Error response from daemon: Cannot stop container mesos-4354a03a-e7ff-45c9-855c-7fcf813002e5: [2] Container does not exist: container destroyed
time="2015-07-27T12:24:59Z" level=fatal msg="Error: failed to stop one or more containers"
W0727 12:25:14.287612 11877 slave.cpp:1934] Ignoring updating pid for framework 20150727-115610-2703018762-5050-13855-0002 because it does not exist

@AndriiOmelianenko
Copy link
Author

/var/log/messages of slave-02. there is also info about kafka task which was received for this node after it failed.

Jul 27 12:24:40 localhost mesos-slave: I0727 12:24:40.158848 11884 slave.cpp:2776] Forwarding the update TASK_FINISHED (UUID: fbf048a7-5cb7-4f22-8dc1-84efe96c0c2c) for task 92 of framework 20150727-115610-270301
8762-5050-13855-0007 to master@<master-IP>:5050
Jul 27 12:24:40 localhost mesos-slave: I0727 12:24:40.159279 11884 slave.cpp:2709] Sending acknowledgement for status update TASK_FINISHED (UUID: fbf048a7-5cb7-4f22-8dc1-84efe96c0c2c) for task 92 of framework 20
150727-115610-2703018762-5050-13855-0007 to executor(1)@<slave-02-IP>:44637
Jul 27 12:24:40 localhost mesos-slave: I0727 12:24:40.163899 11883 status_update_manager.cpp:389] Received status update acknowledgement (UUID: fbf048a7-5cb7-4f22-8dc1-84efe96c0c2c) for task 92 of framework 2015
0727-115610-2703018762-5050-13855-0007
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.290335 11881 slave.cpp:1768] Asked to shut down framework 20150727-115610-2703018762-5050-13855-0007 by master@<master-IP>:5050
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.290784 11881 slave.cpp:1793] Shutting down framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:54 localhost mesos-slave: I0727 12:24:54.292229 11881 slave.cpp:3473] Shutting down executor '20150727-115610-2703018762-5050-13855-S4' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:54 localhost docker: time="2015-07-27T12:24:54Z" level=info msg="+job log(die, 705ef82257a059c1e5a3de7b053b1acd7382472619b8fbf1364541c30592572c, docker.io/mesosphere/spark:1.4.0-rc4-hdfs)"
Jul 27 12:24:54 localhost docker: time="2015-07-27T12:24:54Z" level=info msg="-job log(die, 705ef82257a059c1e5a3de7b053b1acd7382472619b8fbf1364541c30592572c, docker.io/mesosphere/spark:1.4.0-rc4-hdfs) = OK (0)"
Jul 27 12:24:54 localhost docker: time="2015-07-27T12:24:54Z" level=info msg="-job logs(mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863) = OK (0)"
Jul 27 12:24:55 localhost docker: time="2015-07-27T12:24:55Z" level=info msg="-job wait(mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863) = OK (0)"
Jul 27 12:24:55 localhost docker: time="2015-07-27T12:24:55Z" level=info msg="-job wait(mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863) = OK (0)"
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.166388 11878 slave.cpp:2531] Handling status update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framewo
rk 20150727-115610-2703018762-5050-13855-0001 from executor(1)@<slave-02-IP>:34886
Jul 27 12:24:55 localhost mesos-slave: E0727 12:24:55.167184 11878 slave.cpp:2662] Failed to update resources for container c0fbadf6-c823-4780-bb98-05ef48eeb863 of executor driver-20150727121811-0001 running tas
k driver-20150727121811-0001 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/14144/cgroup: Failed to open file '/proc/14144/cgro
up': No such file or directory
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.168290 11878 status_update_manager.cpp:317] Received status update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-
0001 of framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.168303 11882 docker.cpp:1212] Destroying container 'c0fbadf6-c823-4780-bb98-05ef48eeb863'
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.170111 11882 docker.cpp:1274] Sending SIGTERM to executor with pid: 14151
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.168619 11878 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.193764 11882 docker.cpp:1319] Running docker stop on container 'c0fbadf6-c823-4780-bb98-05ef48eeb863'
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.200083 11878 slave.cpp:2776] Forwarding the update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framework 20150727-115610-2703018762-5050-13855-0001 to master@<master-IP>:5050
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.204283 11878 slave.cpp:2709] Sending acknowledgement for status update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framework 20150727-115610-2703018762-5050-13855-0001 to executor(1)@<slave-02-IP>:34886
Jul 27 12:24:55 localhost docker: time="2015-07-27T12:24:55Z" level=info msg="POST /v1.18/containers/mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863/stop?t=0"
Jul 27 12:24:55 localhost docker: time="2015-07-27T12:24:55Z" level=info msg="+job stop(mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863)"
Jul 27 12:24:55 localhost docker: Container already stopped
Jul 27 12:24:55 localhost docker: time="2015-07-27T12:24:55Z" level=info msg="-job stop(mesos-c0fbadf6-c823-4780-bb98-05ef48eeb863) = ERR (1)"
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.217799 11882 status_update_manager.cpp:389] Received status update acknowledgement (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.217927 11882 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_FINISHED (UUID: b1bc757f-6543-4145-80d2-59b50307e680) for task driver-20150727121811-0001 of framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.218967 11883 docker.cpp:1404] Executor for container 'c0fbadf6-c823-4780-bb98-05ef48eeb863' has exited
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.219261 11883 slave.cpp:3223] Executor 'driver-20150727121811-0001' of framework 20150727-115610-2703018762-5050-13855-0001 terminated with signal Terminated
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.241356 11882 slave.cpp:3332] Cleaning up executor 'driver-20150727121811-0001' of framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.241719 11883 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001/executors/driver-20150727121811-0001/runs/c0fbadf6-c823-4780-bb98-05ef48eeb863' for gc 6.99999720283556days in the future
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.241911 11883 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001/executors/driver-20150727121811-0001' for gc 6.99999720226963days in the future
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.242193 11883 gc.cpp:56] Scheduling '/var/lib/mesos/slave/meta/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001/executors/driver-20150727121811-0001/runs/c0fbadf6-c823-4780-bb98-05ef48eeb863' for gc 6.99999720204741days in the future
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.242468 11883 gc.cpp:56] Scheduling '/var/lib/mesos/slave/meta/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001/executors/driver-20150727121811-0001' for gc 6.9999972018637days in the future
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.241776 11882 slave.cpp:3411] Cleaning up framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.243085 11882 status_update_manager.cpp:279] Closing status update streams for framework 20150727-115610-2703018762-5050-13855-0001
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.243144 11885 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001' for gc 6.99999718685037days in the future
Jul 27 12:24:55 localhost mesos-slave: I0727 12:24:55.243600 11885 gc.cpp:56] Scheduling '/var/lib/mesos/slave/meta/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0001' for gc 6.99999718667852days in the future
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.297158 11884 slave.cpp:3543] Killing executor '20150727-115610-2703018762-5050-13855-S4' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.297485 11884 docker.cpp:1212] Destroying container 'ca301d18-542d-4107-b953-95d924fb2f0b'
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.298411 11884 docker.cpp:1274] Sending SIGTERM to executor with pid: 14774
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.319341 11884 docker.cpp:1319] Running docker stop on container 'ca301d18-542d-4107-b953-95d924fb2f0b'
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.343113 11878 docker.cpp:1404] Executor for container 'ca301d18-542d-4107-b953-95d924fb2f0b' has exited
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="POST /v1.18/containers/mesos-ca301d18-542d-4107-b953-95d924fb2f0b/stop?t=0"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="+job stop(mesos-ca301d18-542d-4107-b953-95d924fb2f0b)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="Failed to send SIGTERM to the process, force killing"
Jul 27 12:24:59 localhost docker: Cannot stop container mesos-ca301d18-542d-4107-b953-95d924fb2f0b: [2] Container does not exist: container destroyed
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job stop(mesos-ca301d18-542d-4107-b953-95d924fb2f0b) = ERR (1)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=error msg="Handler for POST /containers/{name:.*}/stop returned error: Cannot stop container mesos-ca301d18-542d-4107-b953-95d924fb2f0b: [2] Container does not exist: container destroyed"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=error msg="HTTP Error: statusCode=500 Cannot stop container mesos-ca301d18-542d-4107-b953-95d924fb2f0b: [2] Container does not exist: container destroyed"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="+job log(die, c3d3d7d1d49ba8d708b651bda36d03ead1cf6186922510847f761bf2302a3edd, docker.io/mesosphere/spark:1.4.0-rc4-hdfs)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job log(die, c3d3d7d1d49ba8d708b651bda36d03ead1cf6186922510847f761bf2302a3edd, docker.io/mesosphere/spark:1.4.0-rc4-hdfs) = OK (0)"
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job logs(mesos-ca301d18-542d-4107-b953-95d924fb2f0b) = OK (0)"
Jul 27 12:24:59 localhost mesos-slave: E0727 12:24:59.444246 11884 slave.cpp:3207] Termination of executor '20150727-115610-2703018762-5050-13855-S4' of framework '20150727-115610-2703018762-5050-13855-0007' failed: Failed to kill the Docker container: Failed to 'docker stop -t 0 mesos-ca301d18-542d-4107-b953-95d924fb2f0b': exit status = exited with status 1 stderr = Error response from daemon: Cannot stop container mesos-ca301d18-542d-4107-b953-95d924fb2f0b: [2] Container does not exist: container destroyed
Jul 27 12:24:59 localhost mesos-slave: time="2015-07-27T12:24:59Z" level=fatal msg="Error: failed to stop one or more containers"
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.444730 11884 slave.cpp:3332] Cleaning up executor '20150727-115610-2703018762-5050-13855-S4' of framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.446398 11879 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0007/executors/20150727-115610-2703018762-5050-13855-S4/runs/ca301d18-542d-4107-b953-95d924fb2f0b' for gc 6.99999483435259days in the future
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.447106 11879 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0007/executors/20150727-115610-2703018762-5050-13855-S4' for gc 6.99999483328593days in the future
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.446424 11884 slave.cpp:3411] Cleaning up framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.448756 11884 status_update_manager.cpp:279] Closing status update streams for framework 20150727-115610-2703018762-5050-13855-0007
Jul 27 12:24:59 localhost mesos-slave: I0727 12:24:59.448973 11885 gc.cpp:56] Scheduling '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0007' for gc 6.9999948065037days in the future
Jul 27 12:24:59 localhost docker: time="2015-07-27T12:24:59Z" level=info msg="-job wait(mesos-ca301d18-542d-4107-b953-95d924fb2f0b) = OK (0)"
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.189869 11878 slave.cpp:1144] Got assigned task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 for framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.190600 11878 slave.cpp:1254] Launching task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 for framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.200199 11878 slave.cpp:4208] Launching executor kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000 in work directory '/var/lib/mesos/slave/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0000/executors/kafka.8242e0d1-345a-11e5-95d2-56847afe9799/runs/741f2f6b-030c-4694-bfc1-087972a4a8f8'
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.200897 11885 docker.cpp:598] No container info found, skipping launch
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.200985 11878 slave.cpp:1401] Queuing task 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' for executor kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework '20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.201521 11883 containerizer.cpp:484] Starting container '741f2f6b-030c-4694-bfc1-087972a4a8f8' for executor 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' of framework '20150727-115610-2703018762-5050-13855-0000'
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.205938 11883 cpushare.cpp:389] Updated 'cpu.shares' to 614 (cpus 0.6) for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.209156 11883 mem.cpp:494] Started listening for OOM events for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.209692 11883 mem.cpp:326] Updated 'memory.soft_limit_in_bytes' to 288MB for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.211053 11883 mem.cpp:361] Updated 'memory.limit_in_bytes' to 288MB for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.212498 11885 linux_launcher.cpp:213] Cloning child process with flags = 0
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.214762 11885 containerizer.cpp:694] Checkpointing executor's forked pid 15585 to '/var/lib/mesos/slave/meta/slaves/20150727-115610-2703018762-5050-13855-S4/frameworks/20150727-115610-2703018762-5050-13855-0000/executors/kafka.8242e0d1-345a-11e5-95d2-56847afe9799/runs/741f2f6b-030c-4694-bfc1-087972a4a8f8/pids/forked.pid'
Jul 27 12:25:04 localhost mesos-slave: I0727 12:25:04.223273 11883 fetcher.cpp:238] Fetching URIs using command '/opt/mesosphere/packages/mesos--5018921cbb873aea2a0db00a407d77a8de419f63/libexec/mesos/mesos-fetcher'
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.322334 11880 slave.cpp:3165] Monitoring executor 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' of framework '20150727-115610-2703018762-5050-13855-0000' in container '741f2f6b-030c-4694-bfc1-087972a4a8f8'
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.347836 11880 slave.cpp:2164] Got registration for executor 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' of framework 20150727-115610-2703018762-5050-13855-0000 from executor(1)@<slave-02-IP>:51938
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.348883 11884 cpushare.cpp:389] Updated 'cpu.shares' to 614 (cpus 0.6) for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.349319 11880 mem.cpp:326] Updated 'memory.soft_limit_in_bytes' to 288MB for container 741f2f6b-030c-4694-bfc1-087972a4a8f8
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.350276 11880 slave.cpp:1555] Sending queued task 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' to executor 'kafka.8242e0d1-345a-11e5-95d2-56847afe9799' of framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.353724 11884 slave.cpp:2531] Handling status update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000 from executor(1)@<slave-02-IP>:51938
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.353943 11884 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.355231 11884 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.400935 11884 slave.cpp:2776] Forwarding the update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000 to master@<master-IP>:5050
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.401167 11884 slave.cpp:2709] Sending acknowledgement for status update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000 to executor(1)@<slave-02-IP>:51938
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.409735 11881 status_update_manager.cpp:389] Received status update acknowledgement (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:13 localhost mesos-slave: I0727 12:25:13.412237 11881 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: a840fd35-07b6-47b2-a503-6722a4a6f48f) for task kafka.8242e0d1-345a-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000
Jul 27 12:25:14 localhost mesos-slave: W0727 12:25:14.287214 11881 slave.cpp:1934] Ignoring updating pid for framework 20150727-115610-2703018762-5050-13855-0002 because it does not exist
Jul 27 12:25:23 localhost systemd: Starting Update systemd-resolved for mesos-dns...
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.148035 11883 mem.cpp:517] OOM notifier is triggered for container 21330774-54d1-41f6-a0a2-5d8271e06ec0
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.148144 11883 mem.cpp:536] OOM detected for container 21330774-54d1-41f6-a0a2-5d8271e06ec0
Jul 27 12:25:23 localhost systemd: Started Update systemd-resolved for mesos-dns.
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.159221 11883 mem.cpp:550] Failed to read 'memory.limit_in_bytes': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.165097 11883 mem.cpp:561] Failed to read 'memory.max_usage_in_bytes': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.174448 11883 mem.cpp:572] Failed to read 'memory.stat': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.178088 11883 mem.cpp:577] Memory limit exceeded:
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.187165 11883 mem.cpp:517] OOM notifier is triggered for container c0fee87b-7db5-43b2-aca4-a27ad6a427af
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.191133 11883 mem.cpp:536] OOM detected for container c0fee87b-7db5-43b2-aca4-a27ad6a427af
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.193645 11883 mem.cpp:550] Failed to read 'memory.limit_in_bytes': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.198787 11883 mem.cpp:561] Failed to read 'memory.max_usage_in_bytes': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.187252 11879 containerizer.cpp:1140] Container 21330774-54d1-41f6-a0a2-5d8271e06ec0 has reached its limit for resource  and will be terminated
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.202169 11879 containerizer.cpp:918] Destroying container '21330774-54d1-41f6-a0a2-5d8271e06ec0'
Jul 27 12:25:23 localhost mesos-slave: E0727 12:25:23.199414 11883 mem.cpp:572] Failed to read 'memory.stat': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.212129 11883 mem.cpp:577] Memory limit exceeded:
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.216240 11883 containerizer.cpp:1140] Container c0fee87b-7db5-43b2-aca4-a27ad6a427af has reached its limit for resource  and will be terminated
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.219189 11883 containerizer.cpp:918] Destroying container 'c0fee87b-7db5-43b2-aca4-a27ad6a427af'
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.208132 11879 cgroups.cpp:2251] Freezing cgroup /sys/fs/cgroup/freezer/mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.221235 11881 cgroups.cpp:2251] Freezing cgroup /sys/fs/cgroup/freezer/mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.328562 11879 cgroups.cpp:1418] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0 after 104.322048ms
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.330082 11878 cgroups.cpp:2268] Thawing cgroup /sys/fs/cgroup/freezer/mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0

Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.333230 11878 cgroups.cpp:1447] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0 after 3.068928ms
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.336803 11879 cgroups.cpp:1418] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af after 105.551872ms
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.348094 11878 cgroups.cpp:2268] Thawing cgroup /sys/fs/cgroup/freezer/mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af
Jul 27 12:25:23 localhost mesos-slave: I0727 12:25:23.353806 11878 cgroups.cpp:1447] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af after 1.758976ms

mesos-slave.WARNING of slave-02

Log file created at: 2015/07/27 12:11:44
Running on machine: slave-02
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0727 12:11:44.222558 11884 mem.cpp:215] Cgroup root for memory disappeared!
E0727 12:24:55.167184 11878 slave.cpp:2662] Failed to update resources for container c0fbadf6-c823-4780-bb98-05ef48eeb863 of executor driver-20150727121811-0001 running task driver-20150727121811-0001 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/14144/cgroup: Failed to open file '/proc/14144/cgroup': No such file or directory
E0727 12:24:59.444246 11884 slave.cpp:3207] Termination of executor '20150727-115610-2703018762-5050-13855-S4' of framework '20150727-115610-2703018762-5050-13855-0007' failed: Failed to kill the Docker container: Failed to 'docker stop -t 0 mesos-ca301d18-542d-4107-b953-95d924fb2f0b': exit status = exited with status 1 stderr = Error response from daemon: Cannot stop container mesos-ca301d18-542d-4107-b953-95d924fb2f0b: [2] Container does not exist: container destroyed
time="2015-07-27T12:24:59Z" level=fatal msg="Error: failed to stop one or more containers"
W0727 12:25:14.287214 11881 slave.cpp:1934] Ignoring updating pid for framework 20150727-115610-2703018762-5050-13855-0002 because it does not exist
E0727 12:25:23.159221 11883 mem.cpp:550] Failed to read 'memory.limit_in_bytes': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
E0727 12:25:23.165097 11883 mem.cpp:561] Failed to read 'memory.max_usage_in_bytes': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
E0727 12:25:23.174448 11883 mem.cpp:572] Failed to read 'memory.stat': 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
E0727 12:25:23.193645 11883 mem.cpp:550] Failed to read 'memory.limit_in_bytes': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
E0727 12:25:23.198787 11883 mem.cpp:561] Failed to read 'memory.max_usage_in_bytes': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
E0727 12:25:23.199414 11883 mem.cpp:572] Failed to read 'memory.stat': 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
E0727 12:25:23.386461 11884 slave.cpp:3207] Termination of executor 'executor.datanode.NodeExecutor.1437999226527' of framework '20150727-115610-2703018762-5050-13855-0006' failed: Failed to clean up an isolator when destroying container '21330774-54d1-41f6-a0a2-5d8271e06ec0' :Failed to get nested cgroups: 'mesos/21330774-54d1-41f6-a0a2-5d8271e06ec0' is not a valid cgroup
E0727 12:25:23.389698 11884 slave.cpp:3207] Termination of executor 'hdfs.a56c1ba0-3458-11e5-95d2-56847afe9799' of framework '20150727-115610-2703018762-5050-13855-0000' failed: Failed to clean up an isolator when destroying container 'c0fee87b-7db5-43b2-aca4-a27ad6a427af' :Failed to get nested cgroups: 'mesos/c0fee87b-7db5-43b2-aca4-a27ad6a427af' is not a valid cgroup
W0727 12:25:23.389802 11879 containerizer.cpp:814] Ignoring update for unknown container: 21330774-54d1-41f6-a0a2-5d8271e06ec0
W0727 12:25:23.391513 11885 containerizer.cpp:814] Ignoring update for unknown container: c0fee87b-7db5-43b2-aca4-a27ad6a427af
W0727 12:25:33.430646 11885 status_update_manager.cpp:472] Resending status update TASK_LOST (UUID: 7d6bbaa9-420a-4694-9393-0864f6c9d071) for task task.datanode.datanode.NodeExecutor.1437999226527 of framework 20150727-115610-2703018762-5050-13855-0006
W0727 12:25:33.488733 11879 slave.cpp:1934] Ignoring updating pid for framework 20150727-115610-2703018762-5050-13855-0004 because it does not exist
W0727 12:25:38.017235 11878 status_update_manager.cpp:185] Resending status update TASK_LOST (UUID: 7d6bbaa9-420a-4694-9393-0864f6c9d071) for task task.datanode.datanode.NodeExecutor.1437999226527 of framework 20150727-115610-2703018762-5050-13855-0006

@AndriiOmelianenko
Copy link
Author

so this what happens when tasks are getting LOST status

E0727 12:25:18.507668 11876 slave.cpp:3207] Termination of executor 'executor.journalnode.NodeExecutor.1437999147307' of framework '20150727-115610-2703018762-5050-13855-0006' failed: Failed to clean up an isola
tor when destroying container 'a99eacdb-d008-4d2d-9983-ea5e429bc7c6' :Failed to get nested cgroups: 'mesos/a99eacdb-d008-4d2d-9983-ea5e429bc7c6' is not a valid cgroup
I0727 12:25:18.518337 11879 cgroups.cpp:1447] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/be363720-66f2-4bf8-9d3a-fedad7141e15 after 1.42208ms
I0727 12:25:18.519414 11876 slave.cpp:2531] Handling status update TASK_LOST (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999147307 of framework 20150727-11
5610-2703018762-5050-13855-0006 from @0.0.0.0:0
W0727 12:25:18.524919 11876 containerizer.cpp:814] Ignoring update for unknown container: a99eacdb-d008-4d2d-9983-ea5e429bc7c6
I0727 12:25:18.543685 11876 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999147307 of framew
ork 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:18.549226 11876 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_LOST (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999
147307 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:18.557823 11879 containerizer.cpp:1123] Executor for container 'be363720-66f2-4bf8-9d3a-fedad7141e15' has exited
E0727 12:25:18.565950 11879 slave.cpp:3207] Termination of executor 'executor.namenode.NameNodeExecutor.1437999176895' of framework '20150727-115610-2703018762-5050-13855-0006' failed: Failed to clean up an isolator when destroying container 'be363720-66f2-4bf8-9d3a-fedad7141e15' :Failed to get nested cgroups: 'mesos/be363720-66f2-4bf8-9d3a-fedad7141e15' is not a valid cgroup
I0727 12:25:18.570531 11879 slave.cpp:2531] Handling status update TASK_LOST (UUID: 1f586227-033a-415c-99a9-71fa6021d959) for task task.namenode.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006 from @0.0.0.0:0
W0727 12:25:18.575659 11878 containerizer.cpp:814] Ignoring update for unknown container: be363720-66f2-4bf8-9d3a-fedad7141e15
I0727 12:25:18.576443 11879 slave.cpp:2531] Handling status update TASK_LOST (UUID: bb124f88-c647-4a30-b9e6-ab1381fc5847) for task task.zkfc.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006 from @0.0.0.0:0
I0727 12:25:18.584483 11879 slave.cpp:2776] Forwarding the update TASK_LOST (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999147307 of framework 20150727-115610-2703018762-5050-13855-0006 to master@<master-ip>:5050
W0727 12:25:18.586000 11875 containerizer.cpp:814] Ignoring update for unknown container: be363720-66f2-4bf8-9d3a-fedad7141e15
I0727 12:25:18.586244 11879 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: 1f586227-033a-415c-99a9-71fa6021d959) for task task.namenode.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:18.588321 11879 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_LOST (UUID: 1f586227-033a-415c-99a9-71fa6021d959) for task task.namenode.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006
E0727 12:25:18.665886 11877 slave.cpp:3207] Termination of executor 'cassandra_dcos.d92137ff-3457-11e5-95d2-56847afe9799' of framework '20150727-115610-2703018762-5050-13855-0000' failed: Failed to clean up an isolator when destroying container 'ff542234-0f62-48f5-817c-6907bc4ef217' :Failed to get nested cgroups: 'mesos/ff542234-0f62-48f5-817c-6907bc4ef217' is not a valid cgroup
I0727 12:25:18.670913 11877 slave.cpp:2531] Handling status update TASK_FAILED (UUID: 596440db-6e40-4c60-a0cf-04060f626a93) for task cassandra_dcos.d92137ff-3457-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000 from @0.0.0.0:0
E0727 12:25:18.671277 11877 slave.cpp:3207] Termination of executor 'cassandra.dcos.node.0.executor' of framework '20150727-115610-2703018762-5050-13855-0004' failed: Failed to clean up an isolator when destroying container 'faae2c99-ab63-4d50-aa4b-ba87b98baa9f' :Failed to get nested cgroups: 'mesos/faae2c99-ab63-4d50-aa4b-ba87b98baa9f' is not a valid cgroup
W0727 12:25:18.671308 11881 containerizer.cpp:814] Ignoring update for unknown container: ff542234-0f62-48f5-817c-6907bc4ef217
I0727 12:25:18.673607 11877 slave.cpp:2531] Handling status update TASK_LOST (UUID: bed7c77a-2888-4fdc-af74-52a0847c619f) for task cassandra.dcos.node.0.executor of framework 20150727-115610-2703018762-5050-13855-0004 from @0.0.0.0:0
W0727 12:25:18.675591 11882 containerizer.cpp:814] Ignoring update for unknown container: faae2c99-ab63-4d50-aa4b-ba87b98baa9f
I0727 12:25:18.676990 11877 slave.cpp:2531] Handling status update TASK_LOST (UUID: e22e9b3f-a13c-4064-b089-98cc5c15fe39) for task cassandra.dcos.node.0.executor.server of framework 20150727-115610-2703018762-5050-13855-0004 from @0.0.0.0:0
W0727 12:25:18.678050 11876 containerizer.cpp:814] Ignoring update for unknown container: faae2c99-ab63-4d50-aa4b-ba87b98baa9f
I0727 12:25:20.628873 11879 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: bb124f88-c647-4a30-b9e6-ab1381fc5847) for task task.zkfc.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:20.629257 11879 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_LOST (UUID: bb124f88-c647-4a30-b9e6-ab1381fc5847) for task task.zkfc.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:20.629066 11876 slave.cpp:2776] Forwarding the update TASK_LOST (UUID: 1f586227-033a-415c-99a9-71fa6021d959) for task task.namenode.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006 to master@<master-ip>:5050
I0727 12:25:20.655002 11879 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999147307 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:20.655347 11879 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_LOST (UUID: 4eab1314-cd09-40b6-834f-7e7e5e29e173) for task task.journalnode.journalnode.NodeExecutor.1437999147307 of framework 20150727-115610-2703018762-5050-13855-0006
I0727 12:25:20.655238 11880 slave.cpp:2776] Forwarding the update TASK_LOST (UUID: bb124f88-c647-4a30-b9e6-ab1381fc5847) for task task.zkfc.namenode.NameNodeExecutor.1437999176895 of framework 20150727-115610-2703018762-5050-13855-0006 to master@<master-ip>:5050
I0727 12:25:20.680302 11879 status_update_manager.cpp:317] Received status update TASK_FAILED (UUID: 596440db-6e40-4c60-a0cf-04060f626a93) for task cassandra_dcos.d92137ff-3457-11e5-95d2-56847afe9799 of framework 20150727-115610-2703018762-5050-13855-0000

@BenWhitehead
Copy link
Contributor

Hmm, thanks for the additional details @AndriiOmelianenko. Can you show me the slave configuration flags that are used for the slaves? Cassandra doesn't run in a docker container but it looks like it may be the docker containerizer that is trying to run the tasks which it won't be able to do.

To get the flags of the slave you can hit the mesos slave http api at http://<slave_host>/slave(1)/state.json and there will be a flags element.

I would expect to see docker,mesos for the containerizers flag.

@AndriiOmelianenko
Copy link
Author

@BenWhitehead yes, there is such options "containerizers":"docker,mesos"

@BenWhitehead
Copy link
Contributor

Thanks @AndriiOmelianenko. I spent some time trying to reproduce this behavior you're seeing as I've never seen it before (nor have my colleagues).

I started a DCOS Cluster on AWS, then ran the following commands:

dcos package update
dcos package install cassandra --yes
dcos package install kafka --yes
dcos package install hdfs --yes
dcos kafka add 0..2
dcos kafka start 0..2
dcos package install chronos --yes

After all frameworks started their tasks and everything was healthy I ran:

dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30'

The spark job completed successfully, everything else kept running and the cluster is still healthy.

Can you create a gist with the following info about your cluster?

Installed packages and version
  • wget $(dcos config show core.dcos_url)/pkgpanda/active.buildinfo.full.json
Installed dcos services
  • wget $(dcos config show core.dcos_url)/marathon/v2/apps

@BenWhitehead
Copy link
Contributor

@AndriiOmelianenko I've finally been able to reproduce this task lost issue.

A fix is in PR #129 This fix will be released with version 0.2.0 that will be released soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants