-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multibyte characters get corrupted when KEYBD trap is set #307
Comments
It works for me, nothing gets mangled. What is the output of your |
Same as for all the other shells. :) So if you copy that line, and paste it into the shell, you don't get a bell and this:
Don't have to try to execute it. For
Again, don't have to try to execute it, just paste it. |
It happens with ksh93u+ too, it’s not a u+m bug. |
These are the results I get: So, it works fine on my end on both 93u+m and 93u+. There is something different about our setups and we need to find out what it is, or I'm not going to be able to try to debug this problem. @JohnoKing, @hyenias, can either of you reproduce this? |
I am unable to reproduce locally using any of my three terminal clients (iTERM, Terminal, and Alacritty). I tested with what I had on my mac at the time (Version AJMv 93u+m/1.0.0-alpha+92f7ca54 2021-05-07) and default macOS ksh. But I do not have macOS Catalina, I have 10.13.6. I am recompiling now with current master. I pasted without issue and even changed the curl command to a function call of echo of "$@" without issue. |
It's my keyboard setup script that runs during .kshrc...
It's enough that the keyboard trap is set. The case/esac in the function can be completely empty, but setting the trap is enough to cause the failure. And all you actually have to paste is the 印度是最好的 |
I think the bug @posguy99 is experiencing may be related to (or is the same as) att#197. After launching ksh with # Repeated attempts to paste `curl -d " 印度是最好的 The weather is good" -X POST http://localhost:8000/questions/21/`
# result in `curl -d " p^` instead.
$ curl -d " p^
$ questions/21/
ksh: questions/21/: not found
$ curl -d " p^
$ questions/21/
ksh: questions/21/: not found |
Agreed. I am able to reproduce the behavior meaning any attempt to paste 印度是最好的 fails when a KEYBD trap is enabled. When I turn KEYBD trap off, I can paste into the command line.
|
Confirmed. After setting that |
In fact, the keyboard trap doesn't even have to do anything. Just |
The RH proposed patch in att#83 fixes it, and doesn’t cause any regression test failures. I'm not sophisticated enough to understand what @kdudka thought was wrong with the patch, though. The trap only pulls single bytes out of the input queue and that causes the bug? (sorry for the email noise... proofread comment before pressing green button) |
I don't know what they're on about in the att#83 discussion re: compilers supporting signed chars or not. As far as I can tell, it's irrelevant, because the data type of the variable that may contain a negated character is int, not char. If anyone has any thoughts on that, I'm all ears; I may be missing something. A cleaned-up version of that patch is: --- a/src/cmd/ksh93/edit/edit.c
+++ b/src/cmd/ksh93/edit/edit.c
@@ -1150,12 +1150,9 @@ int ed_getchar(register Edit_t *ep,int mode)
break;
}
}
- if(n=keytrap(ep,readin,n,LOOKAHEAD-n,mode))
- {
- putstack(ep,readin,n,0);
- c = ep->e_lbuf[--ep->e_lookahead];
- }
else
+ c = -c;
+ if(!keytrap(ep,readin,n,LOOKAHEAD-n,mode))
c = ed_getchar(ep,mode);
ep->e_keytrap = 0;
} It does fix the problem, but introduces a new one: the Still, the correct fix is probably somewhere in this general direction. |
I played around some with that patch and I think I may have fixed it without breaking the --- a/src/cmd/ksh93/edit/edit.c
+++ b/src/cmd/ksh93/edit/edit.c
@@ -1150,10 +1150,15 @@ int ed_getchar(register Edit_t *ep,int mode)
break;
}
}
+ else
+ c = -c;
if(n=keytrap(ep,readin,n,LOOKAHEAD-n,mode))
{
+ int maybe_c;
putstack(ep,readin,n,0);
- c = ep->e_lbuf[--ep->e_lookahead];
+ maybe_c = ep->e_lbuf[--ep->e_lookahead];
+ if(maybe_c > 0)
+ c = maybe_c;
}
else
c = ed_getchar(ep,mode); |
Hmm. I can type multibyte characters (like é or ü) just fine with that patch, but pasting the Chinese characters is still broken. :/ |
Ahhh. As for some near-and-dear Japanese characters, try コーンシェル aka Kornshell. They work for me but the 印度是最好的 (India is the best) does not work correctly. All the Japanese characters for Kornshell (コーンシェル ) are Edit: No idea as to why the 3 byte Japanese UTF8 characters work and the 3 byte Chinese UTF8 characters do not when using your #307 (comment) [maybe_c version].
|
In UTF-8 they're all three bytes. |
@hyenias , so are the sample ones you posted. |
Well, I certainly stand corrected. You just helped me out on #189! Thank you so much as I thought each were two bytes in length. The cleaned up version of att#83 at #307 (comment) seems to work for me on my limited tests. |
With the patch in #307 (comment) , the example Japanese text from @hyenias works without issue, but the Chinese text does not. If you paste the Chinese characters individually, the second and last characters work, but the others have various misbehaviors. |
With the patch in #307 (comment) there is no change for the reproducer in the RH bugreport https://bugzilla.redhat.com/show_bug.cgi?id=1503922 either. The characters are spaced correctly, though, like there's some masking of high-order bits going on. They're not random. |
Here's more weirdness, if you ^V and then try to paste the first Chinese character, you get the Chinese character 印 like you should. If you just paste the first Chinese character, you get a p. KEYBD trap present for both tests. Edit: Does ^V turn off the trap? |
Even more weirdness...
And you get the Chinese characters when you paste them. |
I decided to check how old versions of ksh behaved and it appears this bug is a regression introduced in 93t+ 2009-10-07. The patch below (when applied on top of 93t+ 2009-09-17) is what introduced the bug: --- a/src/cmd/ksh93/RELEASE
+++ b/src/cmd/ksh93/RELEASE
@@ -1,4 +1,9 @@
-09-09-17 --- Release ksh93t+ ---
+09-10-07 --- Release ksh93t+ ---
+09-10-07 $PATH processing has been changed to delay dir stat() and .paths
+ lookup until the directory is needed in the path search.
+09-09-28 Call the ast setlocale() intercept on unset too.
+09-09-24 A bug in which LANG=foo; LC_ALL=foo; unset LC_ALL; did not revert
+ LC_CTYPE etc. to the LANG value has been fixed.
09-09-17 A bug in which unsetting SVLVL could cause a script invoked by
name without #! to core dump has been fixed.
09-09-16 A bug in which a pipeline in a here-document could hang when the
--- a/src/cmd/ksh93/sh/init.c
+++ b/src/cmd/ksh93/sh/init.c
@@ -355,13 +355,17 @@ static void put_cdpath(register Namval_t* np,const char *val,int flags,Namfun_t
}
#endif
- /* Trap for LC_ALL, LC_TYPE, LC_MESSAGES, LC_COLLATE and LANG */
+ /* Trap for LC_ALL, LC_CTYPE, LC_MESSAGES, LC_COLLATE and LANG */
static void put_lang(Namval_t* np,const char *val,int flags,Namfun_t *fp)
{
Shell_t *shp = nv_shell(np);
int type;
char *lc_all = nv_getval(LCALLNOD);
char *name = nv_name(np);
+ if((shp->test&1) && !val && !nv_getval(np))
+ return;
+ if(shp->test&2)
+ nv_putv(np, val, flags, fp);
if(name==(LCALLNOD)->nvname)
type = LC_ALL;
else if(name==(LCTYPENOD)->nvname)
@@ -372,22 +376,30 @@ static void put_cdpath(register Namval_t* np,const char *val,int flags,Namfun_t
type = LC_COLLATE;
else if(name==(LCNUMNOD)->nvname)
type = LC_NUMERIC;
- else if(name==(LANGNOD)->nvname && (!lc_all || *lc_all==0))
- type = LC_ALL;
+#ifdef LC_LANG
+ else if(name==(LANGNOD)->nvname)
+ type = LC_LANG;
+#else
+#define LC_LANG LC_ALL
+ else if(name==(LANGNOD)->nvname && (!lc_all || !*lc_all))
+ type = LC_LANG;
+#endif
else
type= -1;
if(sh_isstate(SH_INIT) && type>=0 && type!=LC_ALL && lc_all && *lc_all)
type= -1;
- if(type>=0 || type==LC_ALL)
+ if(type>=0 || type==LC_ALL || type==LC_LANG)
{
- if(!setlocale(type,val?val:""))
+ if(!setlocale(type,val?val:"-") && val)
{
if(!sh_isstate(SH_INIT) || shp->login_sh==0)
errormsg(SH_DICT,0,e_badlocale,val);
return;
}
}
- if(CC_NATIVE==CC_ASCII && (type==LC_ALL || type==LC_CTYPE))
+ if(!(shp->test&2))
+ nv_putv(np, val, flags, fp);
+ if(CC_NATIVE==CC_ASCII && (type==LC_ALL || type==LC_LANG || type==LC_CTYPE))
{
if(sh_lexstates[ST_BEGIN]!=sh_lexrstates[ST_BEGIN])
free((void*)sh_lexstates[ST_BEGIN]);
@@ -438,7 +450,6 @@ static void put_cdpath(register Namval_t* np,const char *val,int flags,Namfun_t
if(type==LC_ALL || type==LC_MESSAGES)
error_info.translate = msg_translate;
#endif
- nv_putv(np, val, flags, fp);
}
#endif /* _hdr_locale */
--- a/src/lib/libast/RELEASE
+++ b/src/lib/libast/RELEASE
@@ -1,3 +1,7 @@
+09-10-05 _sfopen.c: add but ignore 'F' flags for stdio compatibility
+09-09-28 fts.h,ftwalk.h,fts.c: promote { namelen pathlen level } to (s)size_t
+09-09-28 locales: add AST_LC_LANG for $LANG
+09-09-28 setlocale.c: fix logic for dynamic { LANG LC_ALL LC_* } changes
09-09-17 include/sfio.h,sfio/sfwalk.c: add sfwalk()
09-09-09 sfio/sfputr.c: add SIGPIPE hang fix
09-08-24 sfio/sfreserve.c: fix SF_UNBOUND logic with pushed streams
--- a/src/lib/libast/comp/setlocale.c
+++ b/src/lib/libast/comp/setlocale.c
@@ -570,7 +570,7 @@ static const signed char utf8tab[256] =
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,-1,-1,
};
-int
+static int
utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
{
register unsigned char* sp = (unsigned char*)str;
@@ -616,7 +616,7 @@ utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
return -1;
}
-int
+static int
utf8_mblen(const char* str, size_t n)
{
wchar_t w;
@@ -749,6 +749,9 @@ Lc_category_t lc_categories[] =
{ "LC_PAPER", LC_PAPER, AST_LC_PAPER, 0 },
};
+static Lc_t* lang;
+static Lc_t* lc_all;
+
typedef struct Unamval_s
{
char* name;
@@ -816,7 +819,7 @@ single(int category, Lc_t* lc)
const char* sys;
int i;
- if (!lc && !(lc = lc_categories[category].prev))
+ if (!lc && !(lc = lc_all) && !(lc = lc_categories[category].prev) && !(lc = lang))
lc = lcmake(NiL);
if (locales[category] != lc)
{
@@ -982,6 +985,12 @@ composite(register const char* s, int initialize)
/*
* setlocale() intercept
+ *
+ * locale:
+ * 0 query
+ * "" initialize from environment
+ * "-" unset
+ * * set
*/
char*
@@ -1008,7 +1017,7 @@ _ast_setlocale(int category, const char* locale)
*/
compose:
- if (category != AST_LC_ALL)
+ if (category != AST_LC_ALL && category != AST_LC_LANG)
return (char*)locales[category]->name;
if (!sp && !(sp = sfstropen()))
return 0;
@@ -1043,9 +1052,12 @@ _ast_setlocale(int category, const char* locale)
return sfstruse(sp);
}
if (!ast.locale.serial++)
+ {
stropt(getenv("LC_OPTIONS"), options, sizeof(*options), setopt, NiL);
+ initialized = 0;
+ }
if (*locale)
- p = lcmake(locale);
+ p = streq(locale, "-") ? (Lc_t*)0 : lcmake(locale);
else if (!initialized)
{
char* u;
@@ -1057,63 +1069,75 @@ _ast_setlocale(int category, const char* locale)
*/
u = 0;
- if (!(a = getenv("LC_ALL")) || !*a)
+ if ((s = getenv("LANG")) && *s)
{
- for (i = 1; i < AST_LC_COUNT; i++)
- if ((s = getenv(lc_categories[i].name)) && *s)
- {
- if (streq(s, local) && (u || (u = native_locale(locale, tmp, sizeof(tmp)))))
- s = u;
- lc_categories[i].prev = lcmake(s);
- }
- a = getenv("LANG");
+ if (streq(s, local) && (u || (u = native_locale(locale, tmp, sizeof(tmp)))))
+ s = u;
+ lang = lcmake(s);
}
- if (a)
+ else
+ lang = 0;
+ if ((s = getenv("LC_ALL")) && *s)
{
- if (streq(a, local) && (u || (u = native_locale(locale, tmp, sizeof(tmp)))))
- a = u;
- if (composite(a, 1))
- a = 0;
+ if (streq(s, local) && (u || (u = native_locale(locale, tmp, sizeof(tmp)))))
+ s = u;
+ lc_all = lcmake(s);
}
- p = 0;
+ else
+ lc_all = 0;
for (i = 1; i < AST_LC_COUNT; i++)
- {
- if (!lc_categories[i].prev)
+ if ((s = getenv(lc_categories[i].name)) && *s)
{
- if (!p && !(p = lcmake(a)))
- break;
- lc_categories[i].prev = p;
+ if (streq(s, local) && (u || (u = native_locale(locale, tmp, sizeof(tmp)))))
+ s = u;
+ lc_categories[i].prev = lcmake(s);
}
- if (!single(i, lc_categories[i].prev))
+ else
+ lc_categories[i].prev = 0;
+ for (i = 1; i < AST_LC_COUNT; i++)
+ if (!single(i, lc_all ? lc_all : lc_categories[i].prev))
{
while (i--)
single(i, NiL);
return 0;
}
- }
if (ast.locale.set & AST_LC_debug)
for (i = 1; i < AST_LC_COUNT; i++)
- sfprintf(sfstderr, "locale env %17s %s\n", lc_categories[i].name, locales[i]->name);
+ sfprintf(sfstderr, "locale env %17s %16s %16s\n", lc_categories[i].name, locales[i]->name, lc_categories[i].prev ? lc_categories[i].prev->name : (char*)0);
initialized = 1;
goto compose;
}
- else if (!(p = lc_categories[category].prev))
+ else if (category == AST_LC_LANG || !(p = lc_categories[category].prev))
p = lcmake("C");
- if (category != AST_LC_ALL)
+ if (category == AST_LC_LANG)
+ {
+ if (lang != p)
+ {
+ lang = p;
+ if (!lc_all)
+ for (i = 1; i < AST_LC_COUNT; i++)
+ if (!single(i, lc_categories[i].prev))
+ {
+ while (i--)
+ single(i, NiL);
+ return 0;
+ }
+ }
+ }
+ else if (category != AST_LC_ALL)
return single(category, p);
- else if (!(i = composite(locale, 0)))
+ else if ((i = composite(locale, 0)) < 0)
+ return 0;
+ else if (lc_all != p)
{
- if (!p)
- return 0;
+ lc_all = p;
for (i = 1; i < AST_LC_COUNT; i++)
- if (!single(i, p))
+ if (!single(i, lc_all ? lc_all : lc_categories[i].prev))
{
while (i--)
single(i, NiL);
return 0;
}
}
- else if (i < 0)
- return 0;
goto compose;
}
--- a/src/lib/libast/include/ast_std.h
+++ b/src/lib/libast/include/ast_std.h
@@ -154,6 +154,7 @@ extern char* strerror(int);
#define AST_LC_MEASUREMENT 12
#define AST_LC_PAPER 13
#define AST_LC_COUNT 14
+#define AST_LC_LANG 255
#define AST_LC_find (1L<<28)
#define AST_LC_debug (1L<<29)
@@ -202,6 +203,9 @@ extern char* strerror(int);
#ifndef LC_PAPER
#define LC_PAPER (-AST_LC_PAPER)
#endif
+#ifndef LC_LANG
+#define LC_LANG (-AST_LC_LANG)
+#endif
#undef extern
--- a/src/lib/libast/port/lc.c
+++ b/src/lib/libast/port/lc.c
@@ -125,6 +125,7 @@ lcindex(int category, int min)
case LC_COLLATE: return AST_LC_COLLATE;
case LC_CTYPE: return AST_LC_CTYPE;
case LC_IDENTIFICATION: return AST_LC_IDENTIFICATION;
+ case LC_LANG: return AST_LC_LANG;
case LC_MEASUREMENT: return AST_LC_MEASUREMENT;
case LC_MESSAGES: return AST_LC_MESSAGES;
case LC_MONETARY: return AST_LC_MONETARY; |
Unsetting the $ trap : KEYBD
$ unset LANG
$ print 印度是最好的
印度是最好的 |
Unsetting
|
Yes, that is the logical consequence of unsetting |
Just for fun...
|
A fix remains elusive, but we've got to do something with this bug before releasing 1.0. Currently, if a The patch below simply refuses to execute the That effectively fixes this bug, leaving a much more benign bug or limitation in its place; in multibyte locales, the diff --git a/src/cmd/ksh93/edit/edit.c b/src/cmd/ksh93/edit/edit.c
index d9933b63b..be31ec7cb 100644
--- a/src/cmd/ksh93/edit/edit.c
+++ b/src/cmd/ksh93/edit/edit.c
@@ -1122,7 +1122,9 @@ int ed_getchar(register Edit_t *ep,int mode)
killpg(getpgrp(),SIGINT);
siglongjmp(ep->e_env, UINTR);
}
- if(mode<=0 && sh.st.trap[SH_KEYTRAP])
+ if(mode<=0 && sh.st.trap[SH_KEYTRAP]
+ /* workaround for https://github.com/ksh93/ksh/issues/307 -- ignore non-ASCII in multibyte locale */
+ && (!mbwide() || c > -128))
{
ep->e_keytrap = 1;
n=1; |
Github needs a WONTFIXRIGHTNOW. 😏 |
In UTF-8 locales, ksh breaks when a KEYBD trap is active, even a dummy no-op one like 'trap : KEYBD'. Entering multi-byte characters fails (the input is interrupted and a new prompt is displayed) and pasting content with multi-byte characters produces corrupted results. The cause is that the KEYBD trap code is not multibyte-ready. Unfortunately nobody yet understands the edit.c code well enough to implement a proper fix. Pending that, this commit implements a workaround that at least avoids breaking the shell. src/cmd/ksh93/edit/edit.c: ed_getchar(): - When a multi-byte locale is active, do not trigger the the KEYBD trap except for ASCII characters (1-127). Resolves: #307
The KEYBD trap should now be fully functional for UTF-8 and other multibyte locales. Thanks to Johnothan King for finding this fix! Analysis: The KEYBD trap code processes character code points stored in e_lbuf by ed_read(). But shell variables store bytes, not characters. So, in UTF-8 locales for example, the Unicode code points need to be converted to multibyte UTF-8 encoding. This is needed to calculate the length of each encoded character in bytes (which fixes the corruption issue) and for keytrap() to store its UTF-8 representation in ${.sh.edchar}. src/cmd/ksh93/edit/edit.c: ed_getchar(): - Remove the workaround from the referenced commit. - Use mbconv to convert innput codepoints to bytes before adding them to inbuff, a char array that is passed on to keytrap(). Related: https://bugzilla.redhat.com/show_bug.cgi?id=1503922 Related: att#197 Related: #307 Resolves: #460 Co-authored-by: Johnothan King <[email protected]>
The KEYBD trap should now be fully functional for UTF-8 and other multibyte locales. Thanks to Johnothan King for finding this fix! Analysis: The KEYBD trap code processes character code points stored in e_lbuf by ed_read(). But shell variables store bytes, not characters. So, in UTF-8 locales for example, the Unicode code points need to be converted to multibyte UTF-8 encoding. This is needed to calculate the length of each encoded character in bytes (which fixes the corruption issue) and for keytrap() to store its UTF-8 representation in ${.sh.edchar}. src/cmd/ksh93/edit/edit.c: ed_getchar(): - Remove the workaround from the referenced commit. - Use mbconv to convert innput codepoints to bytes before adding them to inbuff, a char array that is passed on to keytrap(). Related: https://bugzilla.redhat.com/show_bug.cgi?id=1503922 Related: att#197 Related: #307 Resolves: #460 Co-authored-by: Johnothan King <[email protected]>
See:
https://apple.stackexchange.com/questions/420570/i-am-typing-into-my-terminal-and-the-stdin-is-changing-and-adding-characters-on
The stackexchange posting is about zsh, but I tested all the shells I have installed on macOS Catalina. Only ksh93u+m (and 93u+) mangles the input line.
Doesn't matter in emacs mode, in vi mode, both of them off.
The text was updated successfully, but these errors were encountered: