zdaemon transcript thread dies if disk is full #1

mgedmin · 2013-03-14T19:57:22Z

Two days ago I had a bit of a problem with a disk full. Today my entire web server hung and stopped processing requests.

Long story short, when zdaemon's Transcript thread gets an IOError while writing to a log file, it just dies. zdaemon itself and the child program it is managing remain running. The child's stdout/stderr are now pointing to a pipe that is still open at both ends, but now no process ever reads from it. Things seem to run fine for a couple of days, then the kernel's pipe buffer becomes full and the child blocks in write(). While holding the logging mutex. Fun for the whole family, conveniently delayed from the root cause you think you already fixed.

tseaver · 2013-04-23T21:23:22Z

Would changing the code in copy()[1] to use the lock as a context manager make the pain go away? E.g.:

def copy(self):
    lock = self.lock
    i = [self.read_from]
    o = e = []
    while 1:
        ii, oo, ee = select.select(i, o, e)
        with lock:
            for fd in ii:
                self.write(os.read(fd, 8192))

[1] https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L602

tseaver · 2013-04-23T21:23:56Z

Oops, that link should've been:

https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L618

mgedmin · 2013-04-24T05:02:26Z

It's a good change, but it wouldn't help with this issue. The deadlock was not caused by waiting on a lock, it was caused by the child process blocking on a write to a full pipe with no readers.

What might help would be a try/finally (or a with block) that closes the self.read_from pipe. Then the child process would die with OSError(errno.EPIPE) instead of blocking some undetermined time later.

To actually make zdaemon survive a temporary disk full condition we'd have to add a loop around self.write that retries until the write succeeds. I'm not sure that's even possible using Python file objects (do we get the number of bytes successfully written by the underlying syscall?).

This could prevent deadlocks like #1, although not that specific one.

If file permissions change, reopening will fail, and we'll end up getting an exception, a closed self.file, and a background thread that will die next time it tries to write to the now-closed file, leaving the child process writing to a pipe until the pipe buffer fills up at some point in the future (see also bug #1).

Test plan: - mount a small tmpfs with 'sudo mount -t tmpfs none /mnt -o size=4096' - create a zdaemon config file <runner> program yes transcript /mnt/transcript.log </runner> - run zdaemon -C conf start - wait a few milliseconds for /mnt to fill up - pgrep yes If the 'yes' program is still running, we have a deadlock (strace and you'll find it blocked on write()). This is the situation before this patch, as described in bug #1. If the 'yes' program is dead, the deadlock is fixed. - run zdaemon -C conf status The daemon manager should be stopped (it's not functional without the dead transcript thread). Automating this test is left as an exercise for the reader. :(

mgedmin · 2015-04-15T17:38:14Z

I'm not 100% happy with this: the death of the transcript thread ought to produce a log message somewhere (ha! as if that's possible if the disk is full, but anyway, there might be other causes) and it should cause the daemon manager to die ASAP, instead of waiting for the child process to try and log something to stdout. But I'm out of round tuits for today: I came to fix #8, and ended up spending all day on zdaemon.

mgedmin added a commit that referenced this issue Apr 15, 2015

Use 'with lock:' instead of acquire()/release()

7f82548

This could prevent deadlocks like #1, although not that specific one.

mgedmin closed this as completed Apr 15, 2015

mgedmin reopened this Apr 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zdaemon transcript thread dies if disk is full #1

zdaemon transcript thread dies if disk is full #1

mgedmin commented Mar 14, 2013

tseaver commented Apr 23, 2013

tseaver commented Apr 23, 2013

mgedmin commented Apr 24, 2013

mgedmin commented Apr 15, 2015

zdaemon transcript thread dies if disk is full #1

zdaemon transcript thread dies if disk is full #1

Comments

mgedmin commented Mar 14, 2013

tseaver commented Apr 23, 2013

tseaver commented Apr 23, 2013

mgedmin commented Apr 24, 2013

mgedmin commented Apr 15, 2015