Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A recent commit I LGTM'd causes unit tests to fail on macOS but not FreeBSD #110

Closed
krader1961 opened this issue Nov 9, 2017 · 13 comments
Closed
Assignees
Labels

Comments

@krader1961
Copy link
Contributor

I gave a LGTM to commit 8ba041a since building with that change succeeded and gave me a ksh command I could run on macOS and FreeBSD. However, I didn't run the unit tests. I just noticed, as part of testing the fix for issue #7, that those tests fail on macOS but not FreeBSD. The output of bin/package test ast-ksh on macOS begins with the following and not a single test passes:

/Users/krader/projects/3rd-party/ast/arch/darwin.i386-64/bin/ksh: cannot create pipe [Protocol not supported]
package: test output captured in /Users/krader/projects/3rd-party/ast/arch/darwin.i386-64/lib/package/gen/test.out
package: test start at Wed Nov  8 20:59:43 PST 2017 in /Users/krader/projects/3rd-party/ast/arch/darwin.i386-64
++ set -
cmd/INIT:
/Users/krader/projects/3rd-party/ast/arch/darwin.i386-64/src/cmd/INIT
++ regress /Users/krader/projects/3rd-party/ast/src/cmd/INIT/iffe.tst iffe
./regress: line 236: syntax error near unexpected token `('
./regress: line 236: `                  !($KEEP))       j="$j $i" ;;'
make [cmd/INIT]: *** exit code 2 making test.iffe

That first line is probably the important one.

@krader1961 krader1961 added the bug label Nov 9, 2017
@krader1961 krader1961 self-assigned this Nov 9, 2017
@siteshwar
Copy link
Contributor

siteshwar commented Nov 9, 2017

Ksh can use sockets to implement pipes. I suggest you to look into this code.

@siteshwar
Copy link
Contributor

I think there are problems with these socket attributes on macOS.

@qbarnes
Copy link

qbarnes commented Nov 9, 2017

I'd suggest being careful with changing to using sockets. The socket interface used on RHEL 7 has been nothing but a broken mess, yet Red Hat refuses to change the ksh default on their system back to pipes out of fear, I was told:

Switching to real pipes instead of sockets will cause performance degradation. Also, despite of all test cases passing, there is still a major risk of regression with real pipes. It would be too risky to switch to real pipes instead of sockets.

Two problems I ran across with ksh on RHEL 7. First one:
$ cat /etc/passwd | head -1 /dev/stdin
head: cannot open ‘/dev/stdin’ for reading: No such device or address

Second one is a some sort of race that reproduces less frequently the more CPUs you have in the system, but locking ksh to one CPU easily shows it:
$ taskset 1 ksh
$ top
top: failed tty set: Interrupted system call

These problems reproduce with the ksh shipping with RHEL 7.4 as well as att/ast version on master branch. I have a lot more detail on both problems. Just ask if curious.

I would love to know if these problems happen on other OSes (Linux and non-Linux) and what the underlying causes are.

@siteshwar
Copy link
Contributor

$ taskset 1 ksh
$ top
top: failed tty set: Interrupted system call

This is not an issue with sockets. It is a race condtion beause ksh uses posix_spawn to fork processes. You can read the upstream discussion around it here.

@qbarnes
Copy link

qbarnes commented Nov 9, 2017

Ah, that's right. That one wasn't a socket vs. pipe problem, but the posix_spawn() race.

I take it there's no answer to Brian Ginn's question about, "Is there a reliable way to for a process to determine if it is running in the background?"

@krader1961
Copy link
Contributor Author

krader1961 commented Nov 10, 2017

The AST feature testing in this situation is so wrong it's amazing. Look at this block of code from src/lib/libast/features/fcntl.c:

#ifndef SOCK_CLOEXEC
        printf("#ifndef SOCK_CLOEXEC\n");
        printf("#define _ast_SOCK_CLOEXEC       1\n");
        printf("#define SOCK_CLOEXEC            02000000 /* %s */\n", ORIGIN_EXTENSION);
        printf("#endif\n");
#endif
#ifndef SOCK_NONBLOCK
        printf("#ifndef SOCK_NONBLOCK\n");
        printf("#define _ast_SOCK_NONBLOCK      1\n");
        printf("#define SOCK_NONBLOCK           04000   /* %s */\n", ORIGIN_EXTENSION);
        printf("#endif\n");
#endif

You can't arbitrarily define syscall constants like that! Jeebus H. Christ! Those fallback definitions are why this issue exists. This bogosity was papered over by the code in the recently removed src/lib/libast/port/intercept.c module. The removed socktype() function called fcntl() with the correct flags to emulate those incorrect SOCK_* symbols.

On macOS the correct value for SOCK_CLOEXEC (aka O_CLOEXEC) is 0x1000000, not 0x2000000. Although even that is a hack since you can't, technically, pass O_CLOEXEC to the socket() call. It does, however, work. This has been discussed in other projects such as here: https://www.redhat.com/archives/libguestfs/2015-February/msg00054.html.

@krader1961
Copy link
Contributor Author

Also, not to derail this issue, but posix_spawn() is a horrible API. AFAICT the performance improvements it provides is nowhere close to justifying its existence. Its use is also the source of some job management bugs in the Fish shell. So the fact it seems to be the source of bugs in ksh does not surprise me.

@krader1961
Copy link
Contributor Author

It turns out that passing O_CLOEXEC to the socket() call fails on macOS 10.12.6. So even that hack doesn't work. It appears that we need to reinstate the code that was removed which does not rely on SOCK_CLOEXEC.

@siteshwar
Copy link
Contributor

AFAIK ksh uses sockets instead of real pipes due to perfromance reasons. I would like to investigate if we should continue to use this approach or switch to real pipes.

@siteshwar
Copy link
Contributor

test io begins at 2017-11-10+13:33:08                                                                                 
        io.sh[255]: <# not working for pipes               
        io.sh[342]: read -n3 from pipe not working                                                                    
        io.sh[351]: read -n3 from fifo failed -- expected 'a', got 'abc'
        io.sh[354]: read -n1 from fifo failed -- expected 'b', got 'd'                                                
        io.sh[384]: should have timed out                  
        io.sh[385]: line1 should be 'prompt1: '                                                                       
        io.sh[386]: line2 should be line2                  
        io.sh[387]: line3 should be 'prompt2: '                                                                       
        io.sh[411]: LC_ALL=C read -n2 from pipe 'a bcd' failed -- expected 'a bcd', got 'ab cd'
        io.sh[411]: LC_ALL=C.UTF-8 read -n2 from pipe 'a bcd' failed -- expected 'a bcd', got 'ab cd'                 
test io failed at 2017-11-10+13:33:21 with exit code 10 [ 99 tests 10 errors ]
test io(C.UTF-8) begins at 2017-11-10+13:33:21                                                                        
        io.sh[255]: <# not working for pipes                
        io.sh[342]: read -n3 from pipe not working
        io.sh[351]: read -n3 from fifo failed -- expected 'a', got 'abc'
        io.sh[354]: read -n1 from fifo failed -- expected 'b', got 'd'
        io.sh[384]: should have timed out
        io.sh[385]: line1 should be 'prompt1: '
        io.sh[386]: line2 should be line2
        io.sh[387]: line3 should be 'prompt2: '
        io.sh[411]: LC_ALL=C read -n2 from pipe 'a bcd' failed -- expected 'a bcd', got 'ab cd'
        io.sh[411]: LC_ALL=C.UTF-8 read -n2 from pipe 'a bcd' failed -- expected 'a bcd', got 'ab cd'
test io(C.UTF-8) failed at 2017-11-10+13:33:34 with exit code 10 [ 99 tests 10 errors ]

Switching to real pipes gives these test failures. So it looks like real pipes were not well tested.

@krader1961
Copy link
Contributor Author

Sockets were definitely faster than pipes back in the days of SVR2 and BSD 4.2. Mostly because they used a fixed size 8 KiB kernel buffer to shuttle the data between the read and write endpoints and weren't bidirectional. SVR4 STREAMS pipes were much more performant but obviously never got adopted by other OS's. I have no idea what the situation is today but will try to find out.

The one thing that really annoys me (more than the stupid #define SOCK_CLOEXEC 02000000 is the #define pipe ast_pipe. If an abstraction is needed that hides whether pipes, sockets, or some other mechanism is being used at least give it a distinct name like io_channel. Confusing the reader who sees pipe(fds) into thinking that the pipe syscall is being used is an anti-pattern.

@krader1961
Copy link
Contributor Author

Also, it's interesting to note that SOCK_CLOEXEC and SOCK_NONBLOCK do not appear in the index of the current edition of the APUE book. So these are apparently fringe features.

@krader1961
Copy link
Contributor Author

It looks like the performance of pipe() is comparable to socketpair() on some systems (e.g., BSD) and up to 15% slower on others. Whether the additional complexity to detect and use socketpair() is worthwhile is debatable. If we do keep that optimization I'd love to see the code restructured to make it easier to read.

siteshwar pushed a commit that referenced this issue Nov 30, 2017
The `SOCK_CLOEXEC` symbol is not available on some of the platforms
we want to support such as macOS. It's also wrong if it's missing to
define it with a specific value. Instead just use `fcntl()` to set the
close-on-exec attribute.

Also, remove `SOCK_NONBLOCK` because it too is nonstandard and isn't
actually needed to build ksh.

Fixes #110
citrus-it pushed a commit to citrus-it/ast that referenced this issue Apr 15, 2021
Most of these fixes are for typos and extra whitespace at the
end of lines. These are the notable changes:
- Fixed a compatibility issue with how asterisks are displayed
  using certain fonts. Bug report: att#764
- Fixed a bug in the man page that caused searches for the '|'
  character to fail. Bug report: att#871
- Removed a duplicate description of 'set -B' from the man
  page. Bug report: att#789
- Added documentation for options missing from the ksh man
  page (applies to 'hist -N', 'sleep -s', 'whence -q' and
  many of ulimit's options). Bug reports:
  att#948
  att#503 (comment)
  att#507 (comment)
- Applied the following ksh2020 documentation fixes:
  att#351
  att#352
- Fixed a minor GCC -Wformat warning in procopen.c by changing
  a sentinel to NULL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants