-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
${parameter//“”/string} does not work correctly with multibyte string #813
Comments
Thanks for the report!
Testing my stash of compiled commits shows that the bug was introduced in commit ceae1e4. edit: Actually, it doesn't work at all in 93u+ and ksh2020, it has no effect, but at least it doesn't output invalid characters. |
diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index c99ac5719..757ea2d9b 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1916,7 +1916,7 @@ retry2:
flag & STR_MAXIMAL);
else
nmatch = strngrpmatch(v, vsize,
- *pattern ? pattern : "~(E)^",
+ *pattern || (flag & STR_MAXIMAL) ? pattern : "~(E)^",
(ssize_t*)match,
elementsof(match) / 2,
flag | STR_INT); |
However, that fix just masks a bug in the libast regex code, which can also be triggered in the AT&T versions, like so:
|
I don't mind about reverting back to the original behavior. I was wondering if this might be a feature. I was wondering about the behavior of an empty string in a pattern, so I looked up the behavior of other languages (e.g., replaceAll, replaceFirst).Some languages, such as JavaScript and Ruby, seem to insert a character between characters (but also at the end). Frankly, I find these specifications counter-intuitive. Python does not seem to accept empty characters. Since there is no pattern to replace, it is a reasonable enough behavior that nothing changes. |
That bug is in some special-casing code to prevent an infinite loop; it simply wasn't multibyte-aware. Additional patch for that bug: diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index ea936b495..50fd9e235 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1943,9 +1943,12 @@ retry2:
/* avoid infinite loop */
if(nmatch && match[1]==0)
{
+ int sz;
nmatch = 0;
- mac_copy(mp,v,1);
- v++;
+ if ((sz = mbsize(v)) < 1)
+ sz = 1;
+ mac_copy(mp, v, sz);
+ v += sz;
}
tsize -= v-oldv;
continue; Reproducer: |
Koichi Nakashima (@ko1nksm) reports: > $ ksh -c 'v=あいう; echo "${v//""/-}"' > -あ-�-�-い-�-�-う-�-�- > > $ ksh -c 'v=あいう; echo "${v//""/-}"' | od -tx1 > 0000000 2d e3 81 82 2d 81 2d 82 2d e3 81 84 2d 81 2d 84 > 0000020 2d e3 81 86 2d 81 2d 86 2d 0a > 0000032 There are two problems here. First is that the glob pattern in the expansion is empty (after removing the double quotes). Since an empty glob pattern matches nothing, the output string should simply be the value of v. However, since the referenced commit, the empty pattern is changed to ~(E)^, an (anchored) empty ERE. Unlike an empty glob pattern, an empty regular expression matches everything. So a '-' should be inserted at the start, end, and between each character. Which brings us to problem two: lack of multibyte processing in that scenario, causing corrupted output. As this commit, things work correctly: $ ksh -c 'v=あいう; echo "${v//""/-}"' あいう $ ksh -c 'v=あいう; echo "${v//~(E)/-}"' -あ-い-う- src/cmd/ksh93/sh/macro.c: varsub(): - Only convert an empty pattern to ~(E)^ if the operator character is '#'. This should not be done for '/' or '//'. This fixes the first problem. - The case of ${v//~(E)/str} brings us to some code that special- cases the behaviour for nmatch && match[1]==0 to avoid an infinite loop. Make that code multibyte-aware by calculating the size of each character in bytes using mbsize, and advancing by that number instead of 1. (In the case of an invalid multibyte sequence with mbsize returning -1, default to advancing by one byte as before.) This fixes the second problem. Resolves: #813
The text was updated successfully, but these errors were encountered: