-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zdaemon transcript thread dies if disk is full #1
Comments
Would changing the code in copy()[1] to use the lock as a context manager make the pain go away? E.g.:
[1] https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L602 |
Oops, that link should've been: https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L618 |
It's a good change, but it wouldn't help with this issue. The deadlock was not caused by waiting on a lock, it was caused by the child process blocking on a write to a full pipe with no readers. What might help would be a try/finally (or a with block) that closes the self.read_from pipe. Then the child process would die with OSError(errno.EPIPE) instead of blocking some undetermined time later. To actually make zdaemon survive a temporary disk full condition we'd have to add a loop around self.write that retries until the write succeeds. I'm not sure that's even possible using Python file objects (do we get the number of bytes successfully written by the underlying syscall?). |
This could prevent deadlocks like #1, although not that specific one.
If file permissions change, reopening will fail, and we'll end up getting an exception, a closed self.file, and a background thread that will die next time it tries to write to the now-closed file, leaving the child process writing to a pipe until the pipe buffer fills up at some point in the future (see also bug #1).
Test plan: - mount a small tmpfs with 'sudo mount -t tmpfs none /mnt -o size=4096' - create a zdaemon config file <runner> program yes transcript /mnt/transcript.log </runner> - run zdaemon -C conf start - wait a few milliseconds for /mnt to fill up - pgrep yes If the 'yes' program is still running, we have a deadlock (strace and you'll find it blocked on write()). This is the situation before this patch, as described in bug #1. If the 'yes' program is dead, the deadlock is fixed. - run zdaemon -C conf status The daemon manager should be stopped (it's not functional without the dead transcript thread). Automating this test is left as an exercise for the reader. :(
I'm not 100% happy with this: the death of the transcript thread ought to produce a log message somewhere (ha! as if that's possible if the disk is full, but anyway, there might be other causes) and it should cause the daemon manager to die ASAP, instead of waiting for the child process to try and log something to stdout. But I'm out of round tuits for today: I came to fix #8, and ended up spending all day on zdaemon. |
Two days ago I had a bit of a problem with a disk full. Today my entire web server hung and stopped processing requests.
Long story short, when zdaemon's Transcript thread gets an IOError while writing to a log file, it just dies. zdaemon itself and the child program it is managing remain running. The child's stdout/stderr are now pointing to a pipe that is still open at both ends, but now no process ever reads from it. Things seem to run fine for a couple of days, then the kernel's pipe buffer becomes full and the child blocks in write(). While holding the logging mutex. Fun for the whole family, conveniently delayed from the root cause you think you already fixed.
The text was updated successfully, but these errors were encountered: