Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client process lingers around after ssh disconnect #1326

Closed
raphCode opened this issue Apr 14, 2022 · 4 comments · Fixed by #1433
Closed

Client process lingers around after ssh disconnect #1326

raphCode opened this issue Apr 14, 2022 · 4 comments · Fixed by #1433
Assignees

Comments

@raphCode
Copy link
Contributor

This problem was already noticed in #1029 and now moved here.

Reproduction steps:

  • ssh to remote or localhost
  • start zellij or attach to existing session (I advise using --session unique_string to find it more easily in a process listing)
  • disconnect ssh, e.g. by using <Enter>~. escape
  • Observe that the zellij client process did not exit, e.g. with ps or htop

The client process linger around, no matter what. They even do not react to SIGHUP, SIGINT or SIGTERM.

I did a bisect and found 0b74604 to be the first bad commit

@raphCode raphCode changed the title Client process linger around after ssh disconnect Client process lingers around after ssh disconnect Apr 18, 2022
@raphCode
Copy link
Contributor Author

#1414 brought to attention that this is also a problem on local machines without involving ssh at all:

  • open new terminal emulator window
  • start zellij
  • close terminal window
  • zellij client stuck as outlined in first post

Also, on shutdown/reboot systemd waits for the process to exit upon SIGTERM, which does not happen. This introduces a 90 second wait time.

@raphCode
Copy link
Contributor Author

I think there might be race-y things going on, since sometimes the client exits when I try the local reproducer.
strace when the client process exits:

futex(0x55ef8935aa58, FUTEX_WAIT_PRIVATE, 4294967295, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=83768, si_uid=1000} ---
sendto(10, "X", 1, MSG_DONTWAIT, NULL, 0) = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x55ef8935aa58, FUTEX_WAIT_PRIVATE, 4294967295, NULL) = 0
write(7, "\7\0\0\0\4\0\0\0\7\0\0\0\3\0\0\0\0\0\0\0\7\0\0\0\7\0\0\0\7\0\0\0"..., 36) = 36
futex(0x7f7aca8a8910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 83831, NULL, FUTEX_BITSET_MATCH_ANY) = 0
write(3, "INFO   |zellij_client           "..., 122) = 122
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "\33[?1006l\33[?1015l\33[?1002l\33[?1000l", 32) = 32
write(1, "\33[33;1H\n\33[?1049l\33[m\33[?25hBye fro"..., 42) = 42
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=8192}, NULL) = 0
munmap(0x7f7acb1db000, 12288)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

strace when it locks up:

futex(0x5577e01d0a58, FUTEX_WAIT_PRIVATE, 4294967295, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=83913, si_uid=1000} ---
sendto(9, "X", 1, MSG_DONTWAIT, NULL, 0) = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x5577e01d0a58, FUTEX_WAIT_PRIVATE, 4294967295, NULL) = 0
write(5, "\7\0\0\0\4\0\0\0\7\0\0\0\3\0\0\0\0\0\0\0\7\0\0\0\7\0\0\0\7\0\0\0"..., 36) = 36
write(3, "INFO   |zellij_client           "..., 122) = 122
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = -1 EIO (Input/output error)
futex(0x5577e01db080, FUTEX_WAIT_PRIVATE, 2, NULL

@raphCode
Copy link
Contributor Author

raphCode commented May 12, 2022

So what happens is: when the pty zellij-client is running in, disappears before the client exits, it locks up:

  • trying to read from stdin causes a panic:
    let buffer = stdin.fill_buf().unwrap();
  • this triggers the custom panic handler:
    std::panic::set_hook({
    use zellij_utils::errors::handle_panic;
    let send_client_instructions = send_client_instructions.clone();
    let os_input = os_input.clone();
    Box::new(move |info| {
    os_input.unset_raw_mode(0);
    handle_panic(info, &send_client_instructions);
    })
    });
  • the unset_raw_mode(0) added in the commit found by the bisect calls through to this:
    fn unset_raw_mode(pid: RawFd, orig_termios: termios::Termios) {
    match termios::tcsetattr(pid, termios::SetArg::TCSANOW, &orig_termios) {
    Ok(_) => {}
    Err(e) => panic!("error {:?}", e),
    };
    }
  • This causes an ioctl syscall on the file descriptor of the disappeared pty, which of course fails (see strace log), causing a panic again, which locks up everything

Pinging @jaeheonji: you might be interested in this interesting and unintended side effect of your change.

I will think of a solution tomorrow.

@raphCode raphCode self-assigned this May 12, 2022
@jaeheonji
Copy link
Member

jaeheonji commented May 13, 2022

@raphCode Thank you for a detailed description.
I didn't know unset_raw_mode(0) would cause this unintended error 😱

If you need any help, please feel free to ping me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants