Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect internal escaping of glob pattern resulting from unquoted variable expansion #549

Closed
McDutchie opened this issue Oct 7, 2022 · 6 comments
Labels
bug Something is not working

Comments

@McDutchie
Copy link

This is a continuation of #488 (comment) as it is a separate bug.

The following bug shows up in pathname expansion, in every ksh93 version:

$ touch b
$ ksh -c 'p=[a\\-c]; echo $p'
b

That should not output b, as the - operator in the bracket expression is escaped (after the shell's backslash removal).

When we apply the following debug patch to libast glob():

diff --git a/src/lib/libast/misc/glob.c b/src/lib/libast/misc/glob.c
index fcae7afd6..4c05d6a77 100644
--- a/src/lib/libast/misc/glob.c
+++ b/src/lib/libast/misc/glob.c
@@ -613,6 +613,7 @@ _ast_glob(const char* pattern, int flags, int (*errfn)(const char*, int), regist
 	int			extra = 1;
 	unsigned char		intr = 0;

+error(ERROR_warn(0),"[DEBUG] pattern==%s",pattern);
 	gp->gl_rescan = 0;
 	gp->gl_error = 0;
 	gp->gl_errfn = errfn;

…we can see what actually gets passed to glob():

$ touch b
$ arch/darwin.i386-64/bin/ksh -c 'p=[a\\-c]; echo "p==$p"; echo $p'
p==[a\-c]
arch/darwin.i386-64/bin/ksh: warning: [DEBUG] pattern==[a\\-c]
b

So, the backslash is incorrectly escaped itself somewhere along the way, disabling its escaping of -.

@McDutchie McDutchie added the bug Something is not working label Oct 7, 2022
@McDutchie McDutchie changed the title There is still an omission in the [current patch](#issuecomment-1264030235) that shows up in pathname expansion: Incorrect internal escaping of glob pattern resulting from unquoted variable expansion Oct 7, 2022
@McDutchie
Copy link
Author

McDutchie commented Oct 7, 2022

I've managed to trace where the incorrect escaping happens. It is the sfputc() call here in mac_copy():

ksh/src/cmd/ksh93/sh/macro.c

Lines 2423 to 2428 in d84067a

if(n==S_ESC || n==S_EPAT)
{
/* don't allow extended patterns in this case */
mp->patfound = mp->pattern;
sfputc(stkp,ESCAPE);
}

(note that ESCAPE is defined as a backslash char value on line 78 in macro.c)

If we delete that sfputc call, this bug disappears, but the following regression test failures occur instead:

/usr/local/src/ksh93/ksh/src/cmd/ksh93/tests/basic.sh[132]: eval: syntax error at line 1: `"' unmatched
	basic.sh[134]: FAIL: eval foo=\$bar, with bar="foo\ bar" not working
	bracket.sh[167]: FAIL: extended pattern matching on command arguments
	quoting.sh[52]: FAIL: nested $(print -r - '') differs from ''
	quoting.sh[55]: FAIL: "nested $(print -r - '')" differs from ''
	quoting.sh[148]: FAIL: x="\\*";$x != \*
	quoting.sh[204]: FAIL: field splitting of a,b\,c,d with IFS=, not working
	quoting.sh[218]: FAIL: $x, where x=\\(..\\)|&\|\|\\&\\| not working
	quoting.sh[222]: FAIL: a${x}b, where x=\\( not working

@McDutchie
Copy link
Author

Patch with regression tests
diff --git a/src/cmd/ksh93/tests/glob.sh b/src/cmd/ksh93/tests/glob.sh
index 1ea2e46ad..607734402 100755
--- a/src/cmd/ksh93/tests/glob.sh
+++ b/src/cmd/ksh93/tests/glob.sh
@@ -429,6 +429,14 @@ test_glob '<[^N]>' ["^"N]
 test_glob '<[a-c]>' [a\-c]
 test_glob '<[!N]>' [\!N]
 test_glob '<[^N]>' [\^N]
+# Incorrect internal escaping of glob pattern resulting from unquoted variable expansion
+# https://github.com/ksh93/ksh/issues/549
+p='[a\-c]'; test_glob '<[a\-c]>' $p
+p='[\!N]'; test_glob '<[\!N]>' $p
+p='[\^N]'; test_glob '<[\^N]>' $p
+: > -; p='[a\-c]'; test_glob '<->' $p
+: > !; p='[\!N]'; test_glob '<!>' $p
+: > ^; p='[\^N]'; test_glob '<^>' $p
 cd ..

 # ======

Currently failing:

	glob.sh[434]: FAIL: glob -- expected '<[a\-c]>', got '<b>'
	glob.sh[437]: FAIL: glob -- expected '<->', got '<b>'

@JohnoKing
Copy link

This appears to be a regression introduced in ksh93n- 2002-06-28 by the following diff:

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index 1069f05f..132f76c8 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1637,19 +1643,7 @@ static void mac_copy(register Mac_t *mp,register const char *str, register int s
 		}
 		while(size-->0)
 		{
-			if((n=state[c= *(unsigned char*)cp++])==S_ESC)
-			{
-				mp->patfound = mp->pattern;
-				stakputc(ESCAPE);
-				if((n=sh_lexstates[ST_MACRO][*(unsigned char*)cp])==S_PAT || n==S_DIG || *cp=='-' || *cp=='\\')
-				{
-					stakputc(ESCAPE);
-					stakputc(ESCAPE);
-					size--;
-					c = *cp++;
-				}
-			}
-			else if(n==S_EPAT)
+			if((n=state[c= *(unsigned char*)cp++])==S_ESC || n==S_EPAT)
 			{
 				/* don't allow extended patterns in this case */
 				mp->patfound = mp->pattern;

Reverting that change allows the reproducer to work correctly. Unfortunately, reverting it in 93u+m causes a few regression test failures:

test enum begins at 2022-10-13+14:26:46
/home/johno/GitRepos/KornShell/ksh/src/cmd/ksh93/tests/enum.sh[25]: Color_t: x: invalid value fooyl
test enum failed at 2022-10-13+14:26:46 with exit code 1 [ 47 tests 1 error ]
test readonly begins at 2022-10-13+14:27:37
	readonly.sh[332]: FAIL: Readonly variable changed on rtests[5]:  expected 'typeset -r -R 3 x', got 'typeset -x -r -R 3 x=oyl'
test readonly failed at 2022-10-13+14:27:37 with exit code 1 [ 12 tests 1 error ]
93u+m revert patch
diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index dfe087ad4..fc487ea39 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -2441,7 +2441,19 @@ static void mac_copy(register Mac_t *mp,register const char *str, register int s
 				size -= len;
 				continue;
 			}
-			if(n==S_ESC || n==S_EPAT)
+			if(n==S_ESC)
+			{
+				mp->patfound = mp->pattern;
+				stakputc(ESCAPE);
+				if((n=sh_lexstates[ST_MACRO][*(unsigned char*)cp])==S_PAT || n==S_DIG || *cp=='-' || *cp=='\\')
+				{
+					stakputc(ESCAPE);
+					stakputc(ESCAPE);
+					size--;
+					c = *cp++;
+				}
+			}
+			else if(n==S_EPAT)
 			{
 				/* don't allow extended patterns in this case */
 				mp->patfound = mp->pattern;

@McDutchie
Copy link
Author

When I apply the revert patch and change stakputc(ESCAPE) to sfputc(stkp,ESCAPE) as in the rest of that function, I don't get any regression test failures -- and the bug is gone! Nice find!

@McDutchie
Copy link
Author

Actually, it doesn't fix it -- it just masks it.

After applying the debug patch again, we get:

$ p='[a\-c]'; echo $p
arch/darwin.i386-64/bin/ksh: warning: [DEBUG] pattern==[a\\\-c]
[a\-c]

So now there's a spurious escaped backslash included in the pattern as well as the escaped dash.

Which means...

$ touch '\'; p='[a\-c]'; echo $p
arch/darwin.i386-64/bin/ksh: warning: [DEBUG] pattern==[a\\\-c]
\

Yup, it matches a file called \ though that was not included in the bracket pattern.

Back to the drawing board.

@McDutchie
Copy link
Author

Regression tests (two currently failing):

diff --git a/src/cmd/ksh93/tests/glob.sh b/src/cmd/ksh93/tests/glob.sh
index aa06b3d2c..e82438dc9 100755
--- a/src/cmd/ksh93/tests/glob.sh
+++ b/src/cmd/ksh93/tests/glob.sh
@@ -435,5 +435,15 @@ test_glob '<[]-z]>' [']-z']
 test_glob '<[]-z]>' ["]-z"]
 cd ..
 
+# ======
+# Incorrect internal escaping of glob pattern resulting from unquoted variable expansion
+# https://github.com/ksh93/ksh/issues/549
+p='[a\-c]'; test_glob '<[a\-c]>' $p
+p='[\!N]'; test_glob '<[\!N]>' $p
+p='[\^N]'; test_glob '<[\^N]>' $p
+: > -; p='[a\-c]'; test_glob '<->' $p
+: > !; p='[\!N]'; test_glob '<!>' $p
+: > ^; p='[\^N]'; test_glob '<^>' $p
+
 # ======
 exit $((Errors<125?Errors:125))

McDutchie added a commit that referenced this issue May 27, 2023
*** Problem 1:

The following bug shows up in pathname expansion, in every ksh93
version:

    $ mkdir testdir
    $ cd testdir
    $ touch a b c ./-   # GNU touch won't accept - as file name
    $ p='[a\-c]'
    $ echo $p
    a b c

As the - in the bracket expression is backslash-escaped, it should
not be treated as a bracket expression range operator, so the
expected output is:

    - a c

Analysis: For compatibility with the Bourne/POSIX/ksh88, extended
patterns (with a syntax involving parentheses) are not expanded in
fields resulting from field splitting, e.g., from unquoted variable
expansions. The shell adds an extra level of backslash-escaping to
fields resulting from field splitting, so that characters that
indicate extended pattern syntax can be disabled by escaping them.
Of course, the backslash itself also acquires an extra backslash,
and that is the cause of the bug above.

The fix is to remove one layer of backslash-escaping from patterns
resulting from field splitting before expanding them. But we cannot
do this unconditionally; we also need a new flag to check if the
extra escaping is disallowing an extended pattern. See below.

*** Problem 2:

The fix for unwanted brace expansion of ~(E)... patterns and the
like was incomplete; it only checked for ~( at the beginning of the
pattern, but it may occur anywhere in the pattern.

src/cmd/ksh93/sh/expand.c:
- path_expand():
  - Add a new 'musttrim' paraneter.
  - If that parameter is nonzero, trim the pattern before expanding
    by copying it into sh.strbuf and calling sh_trim() on it.
  - Replace the unexpanded result by the original untrimmed pattern
    if it wasn't expanded; this avoids regressions with unquoted
    variables and comsubs.
- must_disallow_bracepat():
  - Add new 'withbackslash' parameter to handle escaped patterns.
  - Don't bother checking for initial ~ as this will now be called
    by path_generate() on encountering a ~ anywhere in the pattern.
  - Return -1 if there is no ( or if the option's don't indicate a
    change in the brace expansion compatibility state.
- path_generate():
  - Add 'musttrim' parameter to pass on to path_expand() and
    must_disallow_bracepat().
  - While scanning the pattern, when encountering '~' and we're not
    already in a brace expansion, call must_disallow_bracepat()
    there instead of at the start, so that the 'nobracepat' state
    is kept up to date throughout the scan. This fixes problem 2.

src/cmd/ksh93/sh/macro.c:
- Add two new fields to the Mac_t struct:
  1. noextpat: set to disallow extended patterns; this is set in
     addition to backslash-escaping & | ( ).
  2. wasexpan: set when varsub() is called (except in the 'nosub:'
     case) or when comsubst() is called; this should cover all the
     expansions that may result in field splitting.
- mac_copy(): When escaping to disallow extended patterns, also set
  noextpat.
- end_field(): Introduce a 'musttrim' flag that is set if wasexpan
  is set (it was a variable or comsub expansion), there is no
  disallowed extended pattern, and the string contains a backslash.
  Pass this flag on to path_expand() or path_generate(). Along with
  the changes in expand.c, this fixes Problem 1.

Resolves: #549
McDutchie added a commit that referenced this issue Jun 13, 2023
@dicktyr (Richard Taityr) reports:
> globs are not expanded with the following:
>
>   % echo "/"{bin,sbin}"/*"
>   /bin/* /sbin/*
>
>   % v=/; echo "$v"bin"/*"
>   /bin/*
>
> but globbing is unexpectedly performed if both parameter
> substitution and brace expansion are present:
>
>   % v=/; echo "$v"{bin,sbin}"/*"
>   [output omitted for the sake of brevity]

So, quoted pattern characters are expanded. No es bueno.

The closest I've been able to come to fixing this so far is:

src/cmd/ksh93/sh/macro.c: endfield():
- Do not set musttrim if mp->quoted is set, i.e., if the argument
  node contains any quoted parts.

This fixes the major bug. But it also partially reintroduces the
previously fixed bug <#549> for
cases where any part of the word is quoted -- even if the quoted
part is empty. For example:

    $ mkdir testdir
    $ cd testdir
    $ touch a b c ./-
    $ p='[a\-c]'
    $ echo $p    # OK
    - a c
    $ echo ""$p  # BUG
    a b c
    $ echo $p""  # BUG
    a b c

The fundamental problem is that, by the time endfield() is reached,
we have a complete word and we can no longer distinguish between
the quoted and unquoted parts of it. The Mac_t flags, such as
mp->quoted, apply to the entire word.

This is a fundamental flaw in the current design, where quotes are
internally translated to backslash escapes for certain characters.
To the best of my knowledge (and I've tried hard), it is not
possible to fix this without introducing other regressions.

A radical redesign of this entire mechanism is needed where this
internal backslash escaping is replaced by a way for every 'struct
argnod' argument node to track for each individual character
whether it is quoted or not, without modifying the argument string
itself. Because it is incorrect to modify the argument string.

But, for the foreseeable future, that is a pipe dream, because
there is no one who fully understands all this code with all its
myriad kludges and workarounds going decades back.

Resolves: #660
McDutchie added a commit that referenced this issue Jun 13, 2023
@dicktyr (Richard Taityr) reports:
> globs are not expanded with the following:
>
>   % echo "/"{bin,sbin}"/*"
>   /bin/* /sbin/*
>
>   % v=/; echo "$v"bin"/*"
>   /bin/*
>
> but globbing is unexpectedly performed if both parameter
> substitution and brace expansion are present:
>
>   % v=/; echo "$v"{bin,sbin}"/*"
>   [output omitted for the sake of brevity]

So, quoted pattern characters are expanded. No es bueno.

The closest I've been able to come to fixing this so far is:

src/cmd/ksh93/sh/macro.c: endfield():
- Do not set musttrim if mp->quoted is set, i.e., if the argument
  node contains any quoted parts.

This fixes the major bug. But it also partially reintroduces the
previously fixed bug <#549> for
cases where any part of the word is quoted -- even if the quoted
part is empty. For example:

    $ mkdir testdir
    $ cd testdir
    $ touch a b c ./-
    $ p='[a\-c]'
    $ echo $p    # OK
    - a c
    $ echo ""$p  # BUG
    a b c
    $ echo $p""  # BUG
    a b c

The fundamental problem is that, by the time endfield() is reached,
we have a complete word and we can no longer distinguish between
the quoted and unquoted parts of it. The Mac_t flags, such as
mp->quoted, apply to the entire word.

This is a fundamental flaw in the current design, where quotes are
internally translated to backslash escapes for certain characters.
To the best of my knowledge (and I've tried hard), it is not
possible to fix this without introducing other regressions.

A radical redesign of this entire mechanism is needed where this
internal backslash escaping is replaced by a way for every 'struct
argnod' argument node to track for each individual character
whether it is quoted or not, without modifying the argument string
itself. Because it is incorrect to modify the argument string.

But, for the foreseeable future, that is a pipe dream, because
there is no one who fully understands all this code with all its
myriad kludges and workarounds going decades back.

Resolves: #660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

2 participants