-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autocomplete should not fill partial multibyte characters #223
Comments
Interestingly, if you press tab twice, it does the right thing after showing the menu. So there is some multibyte support, but it's incomplete. |
ksh/src/cmd/ksh93/edit/completion.c Lines 379 to 380 in 14352ba
|
I think that's fine, I think the bug is actually at ksh/src/cmd/ksh93/edit/completion.c Lines 81 to 82 in 14352ba
If I change this to register const char *strnext;
while((strnext=str,c= mbchar(strnext)) && (d= mbchar(newstr),charcmp(c,d,nocase)))
str=strnext; then the problem is avoided. However, |
Right, that seems pretty simple then, Declaring |
The patch below fixes it for me. I added a few missing diff --git a/src/cmd/ksh93/edit/completion.c b/src/cmd/ksh93/edit/completion.c
index a6086f9a..d6b5dee2 100644
--- a/src/cmd/ksh93/edit/completion.c
+++ b/src/cmd/ksh93/edit/completion.c
@@ -31,6 +31,13 @@
#include "edit.h"
#include "history.h"
+#if !_lib_towupper || !_lib_towlower
+#define towupper(x) toupper(x)
+#define iswupper(x) isupper(x)
+#define towlower(x) tolower(x)
+#define iswlower(x) islower(x)
+#endif
+
static char *fmtx(const char *string)
{
register const char *cp = string;
@@ -39,6 +46,7 @@ static char *fmtx(const char *string)
int offset = staktell();
if(*cp=='#' || *cp=='~')
stakputc('\\');
+ mbinit();
while((c=mbchar(cp)),(c>UCHAR_MAX)||(n=state[c])==0 || n==S_EPAT);
if(n==S_EOF && *string!='#')
return((char*)string);
@@ -62,10 +70,10 @@ static int charcmp(int a, int b, int nocase)
{
if(nocase)
{
- if(isupper(a))
- a = tolower(a);
- if(isupper(b))
- b = tolower(b);
+ if(iswupper(a))
+ a = towlower(a);
+ if(iswupper(b))
+ b = towlower(b);
}
return(a==b);
}
@@ -78,8 +86,10 @@ static int charcmp(int a, int b, int nocase)
static char *overlaid(register char *str,register const char *newstr,int nocase)
{
register int c,d;
- while((c= *(unsigned char *)str) && ((d= *(unsigned char*)newstr++),charcmp(c,d,nocase)))
- str++;
+ register char *strnext;
+ mbinit();
+ while((strnext = str, c = mbchar(strnext)) && (d = mbchar(newstr), charcmp(c,d,nocase)))
+ str = strnext;
if(*str)
*str = 0;
else if(*newstr==0)
@@ -98,6 +108,7 @@ static char *find_begin(char outbuff[], char *last, int endchar, int *type)
int mode=*type;
bp = outbuff;
*type = 0;
+ mbinit();
while(cp < last)
{
xp = cp;
@@ -500,6 +511,7 @@ int ed_expand(Edit_t *ep, char outbuff[],int *cur,int *eol,int mode, int count)
/* first re-adjust cur */
c = outbuff[*cur];
outbuff[*cur] = 0;
+ mbinit();
for(out=outbuff; *out;n++)
mbchar(out);
outbuff[*cur] = c; |
I should not have left that I don't think unconditionally calling |
I see, thanks. How does the patch below look to you? It assumes that, since ASCII is a subset of UTF-8, --- a/src/cmd/ksh93/edit/completion.c
+++ b/src/cmd/ksh93/edit/completion.c
@@ -39,6 +39,7 @@ static char *fmtx(const char *string)
int offset = staktell();
if(*cp=='#' || *cp=='~')
stakputc('\\');
+ mbinit();
while((c=mbchar(cp)),(c>UCHAR_MAX)||(n=state[c])==0 || n==S_EPAT);
if(n==S_EOF && *string!='#')
return((char*)string);
@@ -62,10 +63,22 @@ static int charcmp(int a, int b, int nocase)
{
if(nocase)
{
- if(isupper(a))
- a = tolower(a);
- if(isupper(b))
- b = tolower(b);
+#if _lib_towlower
+ if(mbwide())
+ {
+ if(iswupper((wint_t)a))
+ a = (int)towlower((wint_t)a);
+ if(iswupper((wint_t)b))
+ b = (int)towlower((wint_t)b);
+ }
+ else
+#endif
+ {
+ if(a > 0 && a <= UCHAR_MAX && isupper(a))
+ a = tolower(a);
+ if(b > 0 && b <= UCHAR_MAX && isupper(b))
+ b = tolower(b);
+ }
}
return(a==b);
}
@@ -78,8 +91,10 @@ static int charcmp(int a, int b, int nocase)
static char *overlaid(register char *str,register const char *newstr,int nocase)
{
register int c,d;
- while((c= *(unsigned char *)str) && ((d= *(unsigned char*)newstr++),charcmp(c,d,nocase)))
- str++;
+ char *strnext;
+ mbinit();
+ while((strnext = str, c = mbchar(strnext)) && (d = mbchar(newstr), charcmp(c,d,nocase)))
+ str = strnext;
if(*str)
*str = 0;
else if(*newstr==0)
@@ -98,6 +113,7 @@ static char *find_begin(char outbuff[], char *last, int endchar, int *type)
int mode=*type;
bp = outbuff;
*type = 0;
+ mbinit();
while(cp < last)
{
xp = cp;
@@ -500,6 +516,7 @@ int ed_expand(Edit_t *ep, char outbuff[],int *cur,int *eol,int mode, int count)
/* first re-adjust cur */
c = outbuff[*cur];
outbuff[*cur] = 0;
+ mbinit();
for(out=outbuff; *out;n++)
mbchar(out);
outbuff[*cur] = c; |
Yes, assuming that Elsewhere multibyte code is guarded by a Style wise, feel free to completely disregard this, I have no stake in it, but: |
Well, I just learned something… "messy" is right. ksh has explicit code to support EBCDIC in various places. It's probably broken, though, since I'm a mere mortal and don't have access to an IBM mainframe to test it. Even if it isn't broken, I don't think the current EBCDIC code could possibly coexist with UTF-8, as ASCII is a subset of UTF-8, whereas EBCDIC support is hard-coded at the preprocessor level (many variants of things like So I've been assuming an ASCII subset while fixing code that is specific to multibyte locales, e.g. here. Maybe that could be done in a better way.
I've actually removed many of the ksh/src/cmd/ksh93/include/defs.h Lines 35 to 44 in aacf0d0
Hmm. I'll have to take another look at that. Maybe a few more macros need redefining since they don't in fact all check mbwide() . But it may not matter.
Googling it gave me the impression that Thanks for the very helpful comments and suggestions. |
|
Right, I'm pretty sure that POSIX does require somewhere that each character in the portable character set (not sure exactly which characters) has the same value in all locales, so EBCDIC cannot coexist with UTF-8. However, EBCDIC could potentially coexist with UTF-EBCDIC, in which case you would probably still have a Unicode If I were in your position I would say that EBCDIC would be supported only in the non-multibyte case unless someone who wants to have it supported in the multibyte case provides you with a way to actually test that.
That's actually what I do in my shell as well, but specifically because it helps one particular character set. In ja_JP.SJIS locales (that conflict with POSIX), the code point that corresponds to
Ah, nice. That makes sense to me.
The casts just make explicit the implicit conversions that would happen anyway. For fully portable code,
I'm aware. |
Hrm. The ksh/src/cmd/ksh93/edit/completion.c Lines 375 to 376 in aacf0d0
And that variable only contains c when we're compiling on Windows (_WINIX ): ksh/src/lib/libast/port/astconf.c Lines 220 to 235 in aacf0d0
So, on every non-Windows platform, that whole case insensitivity business is currently dead code. I added However, on both Windows and macOS, it is possible to mount case-sensitive file systems, so case sensitivity can depend on the particular directory you're in. So, it looks to me like this whole approach is inherently broken, like so much of the AT&T stuff. Sigh. There is also a libast Maybe the most sensible resolution for now would be to override my no-new-features policy again and add a shell option for case insensitivity in file name completion. If so, it should be turned on by default on both Windows and macOS since case insensitivity is the default setting on those platforms. |
There is more going on. If I reduce |
The answer would be that file name completion is based on globbing, and globbing itself is made case insensitive in libast based on ksh/src/lib/libast/misc/glob.c Lines 143 to 148 in aacf0d0
So, it looks like that whole It also seems like the path does get passed as a parameter when querying the |
Don't you still need it even with case-insensitive globbing? Suppose you have files As for why it did not work on macOS, an actual probe for a case insensitive file system seems to be implemented for systems that support |
That is not possible on a case insensitive file system.
Yes, I'm trying it out (with files
It might be worth doing sometime. Or maybe there is a portable way. The globbing routine could do an access check on the first encountered file that contains at least one alphabetic character. If there are no such files, it doesn't matter anyway, so use the case sensitive default. Otherwise, change its name to all uppercase, or (if it only contains uppercase) all lowercase. Then check if they are the same file in the manner of |
It's probably the only the portable-ish way of doing it and should give accurate results, so not crazy. There may be different platform-specific solutions that you can use instead though that might be good enough. A search brings up that |
Well, that turns out to be rather simple to add. I haven't tested Cygwin yet but it seems to work like a charm on macOS. It even does the right thing for pathnames that span case-insensitive and case-sensitive file systems. The support for that was already coded, but only worked on UWIN so far. --- a/src/lib/libast/port/astconf.c
+++ b/src/lib/libast/port/astconf.c
@@ -706,6 +706,12 @@ format(register Feature_t* fp, const char* path, const char* value, unsigned int
}
*s = 0;
}
+#elif _lib_pathconf && defined(_PC_CASE_SENSITIVE)
+ /* macOS: support UWIN 'c' attribute for file system case insensitivity */
+ fp->value = pathconf(path, _PC_CASE_SENSITIVE) == 0L ? "c" : null;
+#elif _lib_pathconf && defined(_PC_CASE_INSENSITIVE)
+ /* Cygwin: support UWIN 'c' attribute for file system case insensitivity. TODO: test */
+ fp->value = pathconf(path, _PC_CASE_INSENSITIVE) > 0L ? "c" : null;
#endif
break;
|
No shell on the Mac currently does case-insensitive globbing, though. The patch above makes it do the right thing depending on the file system, but I wonder if it would surprise/upset users who are used to case-sensitive globbing even on case-insensitive file systems. This could be a problem especially when using |
Answering the question asked here , I think if you do that, you violate least surprise. The expectation is that what I typed, was used as the match. Probably not one person in 100 ever thinks about the underlying filesystem being case-insensitive, ESPECIALLY if it is case-preserving, as HFS+ and APFS are... or whack ideas like NTFS where the underlying filesystem is case-sensitive but the API calls are not. macOS Catalina:
Mint Linux:
I would not expect the second Worse would be if the behavior was different. It's an issue of "correct" vs "right". |
Thanks for your thoughtful reply. I tend to agree. But I don't want to just delete the code that does this, either. The functionality is basically done, it was clearly intended. And personally I like it. ksh would be the only shell were case-insensitive globbing properly depends on the case-insensitivity of the file system, detected separately for each pathname component so it does the right thing when mounting a case-sensitive filesystem in a directory on a case-insensitive file system, or vice versa. It's a nice selling point. So, to avoid surprises, I think a new shell option (off by default) is probably the right way forward. It's going to be a little more complicated than usual, as globbing is implemented entirely in libast, and ksh has basically nothing to do with it. So I'll need to add some way to allow ksh to influence libast's globbing behaviour. |
One of the best-kept secrets of libast/ksh93 is that the code includes support for case-insensitive file name generation (a.k.a. pathname expansion, a.k.a. globbing) as well as case-insensitive file name completion on interactive shells, depending on whether the file system is case-insensitive or not. This is transparently determined for each directory, so a path pattern that spans multiple file systems can be part case-sensitive and part case- insensitive. In more precise terms, each slash-separated path name component pattern P is treated as ~(i:P) if its parent directory exists on a case-insensitive file system. I recently discovered this while dealing with <#223>. However, that support is dead code on almost all current systems. It depends on pathconf(2) having a _PC_PATH_ATTRIBUTES selector. The 'c' attribute is supposedly returned if the given directory is on a case insensitive file system. There are other attributes as well (at least 'l', see src/lib/libcmd/rm.c). However, I have been unable to find any system, current or otherwise, that has _PC_PATH_ATTRIBUTES. Google and mailing list searches yield no relevant results at all. If anyone knows of such a system, please add a comment to this commit on GitHub, or email me. An exception is Cygwin/Windows, on which the "c" attribute was simply hardcoded, so globbing/completion is always case- insensitive. As of Windows 10, that is wrong, as it added the possibility to mount case-sensitive file systems. On the other hand, this was never activated on the Mac, even though macOS has always used a case-insensitive file like Windows. But, being UNIX, it can also mount case-sensitive file systems. Finally, Linux added the possibility to create individual case- insensitive ext4 directories fairly recently, in version 5.2. https://www.collabora.com/news-and-blog/blog/2020/08/27/using-the-linux-kernel-case-insensitive-feature-in-ext4/ So, since this functionality latently exists in the code base, and three popular OSs now have relevant file system support, we might as well make it usable on those systems. It's a nice idea, as it intuitively makes sense for globbing and completion behaviour to auto-adapt to file system case insensitivity on a per-directory basis. No other shell does this, so it's a nice selling point, too. However, the way it is coded, this is activated unconditionally on supported systems. That is not a good idea. It will surprise users. Since globbing is used with commands like 'rm', we do not want surprises. So this commit makes it conditional upon a new shell option called 'globcasedetect'. This option is only compiled into ksh on systems where we can actually detect FS case insensitivity. To implement this, libast needs some public API additions first. *** libast changes *** src/lib/libast/features/lib: - Add probes for the linux/fs.h and sys/ioctl.h headers. Linux needs these to use ioctl(2) in pathicase(3) (see below). src/lib/libast/path/pathicase.c, src/lib/libast/include/ast.h, src/lib/libast/man/path.3, src/lib/libast/Mamfile: - Add new pathicase(3) public API function. This uses whatever OS-specific method it can detect at compile time to determine if a particular path is on a case-insensitive file system. If no method is available, it only sets errno to ENOSYS and returns -1. Currently known to work on: macOS, Cygwin, Linux 5.2+, QNX 7.0+. - On systems (if any) that have the mysterious _PC_PATH_ATTRIBUTES selector for pathconf(2), call astconf(3) and check for the 'c' attribute to determine case insensitivity. This should preserve compatibility with any such system. src/lib/libast/port/astconf.c: - dynamic[]: As case-insensitive globbing is now optional on all systems, do not set the 'c' attribute by default on _WINIX (Cygwin/Windows) systems. - format(): On systems that do not have _PC_PATH_ATTRIBUTES, call pathicase(3) to determine the value for the "c" (case insensitive) attribute only. This is for compatibility as it is more efficient to call pathicase(3) directly. src/lib/libast/misc/glob.c, src/lib/libast/include/glob.h: - Add new GLOB_DCASE public API flag to glob(3). This is like GLOB_ICASE (case-insensitive matching) except it only makes the match case-insensitive if the file system for the current pathname component is determined to be case-insensitive. - gl_attr(): For efficiency, call pathicase(3) directly instead of via astconf(3). - glob_dir(): Only call gl_attr() to determine file system case insensitivity if the GLOB_DCASE flag was passed. This makes case insensitive globbing optional on all systems. - glob(): The options bitmask needs to be widened to fit the new GLOB_DCASE option. Define this centrally in a new GLOB_FLAGMASK macro so it is easy to change it along with GLOB_MAGIC (which uses the remaining bits for a sanity check bit pattern). src/lib/libast/path/pathexists.c: - For efficiency, call pathicase(3) directly instead of via astconf(3). *** ksh changes *** src/cmd/ksh93/features/options, src/cmd/ksh93/SHOPT.sh: - Add new SHOPT_GLOBCASEDET compile-time option. Set it to probe (empty) by default so that the shell option is compiled in on supported systems only, which is determined by new iffe feature test that checks if pathicase(3) returns an ENOSYS error. src/cmd/ksh93/data/options.c, src/cmd/ksh93/include/shell.h: - Add -o globcasedetect shell option if compiling with SHOPT_GLOBCASEDET. src/cmd/ksh93/sh/expand.c: path_expand(): - Pass the new GLOB_DCASE flag to glob(3) if the globcasedetect/SH_GLOBCASEDET shell option is set. src/cmd/ksh93/edit/completion.c: - While file listing/completion is based on globbing and automatically becomes case-insensitive when globbing does, it needs some additional handling to make a string comparison case-insensitive in corresponding cases. Otherwise, partial completions may be deleted from the command line upon pressing tab. This code was already in ksh 93u+ and just needs to be made conditional upon SHOPT_GLOBCASEDET and globcasedetect. - For efficiency, call pathicase(3) directly instead of via astconf(3). src/cmd/ksh93/sh.1: - Document the new globcasedetect shell option.
If I have files
'XXXá'
and'XXXë'
($'XXX\xc3\xa1'
and$'XXX\xc3\xab'
) and have typedXX
as a command argument, autocomplete should on the first Tab appendX
to showXXX
. What actually happens is that autocomplete attempts to complete to$'XXX\xc3'
, since the first byte ofá
andë
is the same, displayingXXX^
and leaving the editor in a bad state where subsequent keypresses can move to before the start of the line.The text was updated successfully, but these errors were encountered: