Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

${parameter//“”/string} does not work correctly with multibyte string #813

Closed
ko1nksm opened this issue Dec 29, 2024 · 5 comments
Closed
Labels
bug Something is not working

Comments

@ko1nksm
Copy link

ko1nksm commented Dec 29, 2024

$ ksh --version
  version         sh (AT&T Research) 93u+m/1.1.0-alpha+be0cfbff 2024-12-29

$ ksh -c 'v=あいう; echo "${v//""/-}"'
-あ-�-�-い-�-�-う-�-�-

$ ksh -c 'v=あいう; echo "${v//""/-}"' | od -tx1
0000000 2d e3 81 82 2d 81 2d 82 2d e3 81 84 2d 81 2d 84
0000020 2d e3 81 86 2d 81 2d 86 2d 0a
0000032
@McDutchie McDutchie added the bug Something is not working label Dec 30, 2024
@McDutchie
Copy link

McDutchie commented Dec 30, 2024

Thanks for the report!

It works correctly in 93u+ and ksh2020, so we introduced this bug. :(

Testing my stash of compiled commits shows that the bug was introduced in commit ceae1e4.

edit: Actually, it doesn't work at all in 93u+ and ksh2020, it has no effect, but at least it doesn't output invalid characters.

@McDutchie
Copy link

${v//""/-} has no effect in bash or mksh either, it only works on zsh. So I think it is okay to revert to the previous behaviour in this case. This patch does that.

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index c99ac5719..757ea2d9b 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1916,7 +1916,7 @@ retry2:
 							flag & STR_MAXIMAL);
 					else
 						nmatch = strngrpmatch(v, vsize,
-							*pattern ? pattern : "~(E)^",
+							*pattern || (flag & STR_MAXIMAL) ? pattern : "~(E)^",
 							(ssize_t*)match,
 							elementsof(match) / 2,
 							flag | STR_INT);

@McDutchie
Copy link

However, that fix just masks a bug in the libast regex code, which can also be triggered in the AT&T versions, like so:

$ /bin/ksh -c 'echo ${.sh.version}; v=あいう; echo "${v//~(E)^/-}"'
Version AJM 93u+ 2012-08-01
-あ-?-?-い-?-?-う-?-?-
$ ksh-2020.0.1 -c 'echo ${.sh.version}; v=あいう; echo "${v//~(E)^/-}"'
Version A 2020.0.0-8-ge4fea8c5
-?-?-?-?-?-?-?-?-?-

@ko1nksm
Copy link
Author

ko1nksm commented Dec 31, 2024

I don't mind about reverting back to the original behavior. I was wondering if this might be a feature.

I was wondering about the behavior of an empty string in a pattern, so I looked up the behavior of other languages (e.g., replaceAll, replaceFirst).Some languages, such as JavaScript and Ruby, seem to insert a character between characters (but also at the end). Frankly, I find these specifications counter-intuitive. Python does not seem to accept empty characters.

Since there is no pattern to replace, it is a reasonable enough behavior that nothing changes.

@McDutchie
Copy link

However, that fix just masks a bug in the libast regex code, which can also be triggered in the AT&T versions, like so:

That bug is in some special-casing code to prevent an infinite loop; it simply wasn't multibyte-aware. Additional patch for that bug:

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index ea936b495..50fd9e235 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1943,9 +1943,12 @@ retry2:
 						/* avoid infinite loop */
 						if(nmatch && match[1]==0)
 						{
+							int	sz;
 							nmatch = 0;
-							mac_copy(mp,v,1);
-							v++;
+							if ((sz = mbsize(v)) < 1)
+								sz = 1;
+							mac_copy(mp, v, sz);
+							v += sz;
 						}
 						tsize -= v-oldv;
 						continue;

Reproducer:

McDutchie added a commit that referenced this issue Jan 7, 2025
Koichi Nakashima (@ko1nksm) reports:
> $ ksh -c 'v=あいう; echo "${v//""/-}"'
> -あ-�-�-い-�-�-う-�-�-
>
> $ ksh -c 'v=あいう; echo "${v//""/-}"' | od -tx1
> 0000000 2d e3 81 82 2d 81 2d 82 2d e3 81 84 2d 81 2d 84
> 0000020 2d e3 81 86 2d 81 2d 86 2d 0a
> 0000032

There are two problems here. First is that the glob pattern in the
expansion is empty (after removing the double quotes). Since an
empty glob pattern matches nothing, the output string should simply
be the value of v. However, since the referenced commit, the empty
pattern is changed to ~(E)^, an (anchored) empty ERE. Unlike an
empty glob pattern, an empty regular expression matches everything.
So a '-' should be inserted at the start, end, and between each
character. Which brings us to problem two: lack of multibyte
processing in that scenario, causing corrupted output.

As this commit, things work correctly:

  $ ksh -c 'v=あいう; echo "${v//""/-}"'
  あいう
  $ ksh -c 'v=あいう; echo "${v//~(E)/-}"'
  -あ-い-う-

src/cmd/ksh93/sh/macro.c: varsub():
- Only convert an empty pattern to ~(E)^ if the operator character
  is '#'. This should not be done for '/' or '//'. This fixes
  the first problem.
- The case of ${v//~(E)/str} brings us to some code that special-
  cases the behaviour for nmatch && match[1]==0 to avoid an
  infinite loop. Make that code multibyte-aware by calculating the
  size of each character in bytes using mbsize, and advancing by
  that number instead of 1. (In the case of an invalid multibyte
  sequence with mbsize returning -1, default to advancing by one
  byte as before.) This fixes the second problem.

Resolves: #813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

2 participants