Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autocomplete should not fill partial multibyte characters #223

Closed
hvdijk opened this issue Mar 14, 2021 · 21 comments
Closed

Autocomplete should not fill partial multibyte characters #223

hvdijk opened this issue Mar 14, 2021 · 21 comments
Labels
bug Something is not working

Comments

@hvdijk
Copy link

hvdijk commented Mar 14, 2021

If I have files 'XXXá' and 'XXXë' ($'XXX\xc3\xa1' and $'XXX\xc3\xab') and have typed XX as a command argument, autocomplete should on the first Tab append X to show XXX. What actually happens is that autocomplete attempts to complete to $'XXX\xc3', since the first byte of á and ë is the same, displaying XXX^ and leaving the editor in a bad state where subsequent keypresses can move to before the start of the line.

@McDutchie McDutchie added the bug Something is not working label Mar 14, 2021
@McDutchie
Copy link

Interestingly, if you press tab twice, it does the right thing after showing the menu. So there is some multibyte support, but it's incomplete.

@McDutchie
Copy link

McDutchie commented Mar 16, 2021

I think the bug may be here:

/* just expand until name is unique */
size += strlen(*com);

@hvdijk
Copy link
Author

hvdijk commented Mar 16, 2021

I think that's fine, I think the bug is actually at

while((c= *(unsigned char *)str) && ((d= *(unsigned char*)newstr++),charcmp(c,d,nocase)))
str++;

If I change this to

       register const char *strnext;
       while((strnext=str,c= mbchar(strnext)) && (d= mbchar(newstr),charcmp(c,d,nocase)))
               str=strnext;

then the problem is avoided. However, charcmp is not designed to be called with anything other than unsigned chars converted to int so this will need more work.

@McDutchie
Copy link

Right, that seems pretty simple then, charcmp just needs to use towlower(3), etc. instead of the single-byte versions.

Declaring strnext const produces a compiler warning about incompatible pointer types when assigning str to strnext. Why did you add the const and is there any harm in removing it?

@McDutchie
Copy link

McDutchie commented Mar 17, 2021

The patch below fixes it for me. I added a few missing mbinit calls as well. These are needed in case multibyte processing was previously interrupted in the middle of processing a multibyte character. (Before the fix to #5, mbinit calls were ineffective, which could cause various character garbling bugs.)

diff --git a/src/cmd/ksh93/edit/completion.c b/src/cmd/ksh93/edit/completion.c
index a6086f9a..d6b5dee2 100644
--- a/src/cmd/ksh93/edit/completion.c
+++ b/src/cmd/ksh93/edit/completion.c
@@ -31,6 +31,13 @@
 #include	"edit.h"
 #include	"history.h"
 
+#if !_lib_towupper || !_lib_towlower
+#define towupper(x)	toupper(x)
+#define iswupper(x)	isupper(x)
+#define towlower(x)	tolower(x)
+#define iswlower(x)	islower(x)
+#endif
+
 static char *fmtx(const char *string)
 {
 	register const char	*cp = string;
@@ -39,6 +46,7 @@ static char *fmtx(const char *string)
 	int offset = staktell();
 	if(*cp=='#' || *cp=='~')
 		stakputc('\\');
+	mbinit();
 	while((c=mbchar(cp)),(c>UCHAR_MAX)||(n=state[c])==0 || n==S_EPAT);
 	if(n==S_EOF && *string!='#')
 		return((char*)string);
@@ -62,10 +70,10 @@ static int charcmp(int a, int b, int nocase)
 {
 	if(nocase)
 	{
-		if(isupper(a))
-			a = tolower(a);
-		if(isupper(b))
-			b = tolower(b);
+		if(iswupper(a))
+			a = towlower(a);
+		if(iswupper(b))
+			b = towlower(b);
 	}
 	return(a==b);
 }
@@ -78,8 +86,10 @@ static int charcmp(int a, int b, int nocase)
 static char *overlaid(register char *str,register const char *newstr,int nocase)
 {
 	register int c,d;
-	while((c= *(unsigned char *)str) && ((d= *(unsigned char*)newstr++),charcmp(c,d,nocase)))
-		str++;
+	register char *strnext;
+	mbinit();
+	while((strnext = str, c = mbchar(strnext)) && (d = mbchar(newstr), charcmp(c,d,nocase)))
+		str = strnext;
 	if(*str)
 		*str = 0;
 	else if(*newstr==0)
@@ -98,6 +108,7 @@ static char *find_begin(char outbuff[], char *last, int endchar, int *type)
 	int		mode=*type;
 	bp = outbuff;
 	*type = 0;
+	mbinit();
 	while(cp < last)
 	{
 		xp = cp;
@@ -500,6 +511,7 @@ int ed_expand(Edit_t *ep, char outbuff[],int *cur,int *eol,int mode, int count)
 		/* first re-adjust cur */
 		c = outbuff[*cur];
 		outbuff[*cur] = 0;
+		mbinit();
 		for(out=outbuff; *out;n++)
 			mbchar(out);
 		outbuff[*cur] = c;

@hvdijk
Copy link
Author

hvdijk commented Mar 17, 2021

I should not have left that const in. I originally had two variables and made them const char * to match newstr, realised I did not need the second one and removed it, but failed to update the declaration properly.

I don't think unconditionally calling iswupper is right: mbchar() returns either a single byte character if the current locale has no multibyte characters or a wide character if it does. In the former case, isupper should be called, in the latter case iswupper should be called. It's messy.

@McDutchie
Copy link

McDutchie commented Mar 17, 2021

I don't think unconditionally calling iswupper is right: mbchar() returns either a single byte character if the current locale has no multibyte characters or a wide character if it does. In the former case, isupper should be called, in the latter case iswupper should be called. It's messy.

I see, thanks. How does the patch below look to you? It assumes that, since ASCII is a subset of UTF-8, iswupper, etc. will correctly handle ASCII characters in UTF-8 locales even though they are still single-byte characters. Do you think that is correct?

--- a/src/cmd/ksh93/edit/completion.c
+++ b/src/cmd/ksh93/edit/completion.c
@@ -39,6 +39,7 @@ static char *fmtx(const char *string)
 	int offset = staktell();
 	if(*cp=='#' || *cp=='~')
 		stakputc('\\');
+	mbinit();
 	while((c=mbchar(cp)),(c>UCHAR_MAX)||(n=state[c])==0 || n==S_EPAT);
 	if(n==S_EOF && *string!='#')
 		return((char*)string);
@@ -62,10 +63,22 @@ static int charcmp(int a, int b, int nocase)
 {
 	if(nocase)
 	{
-		if(isupper(a))
-			a = tolower(a);
-		if(isupper(b))
-			b = tolower(b);
+#if _lib_towlower
+		if(mbwide())
+		{
+			if(iswupper((wint_t)a))
+				a = (int)towlower((wint_t)a);
+			if(iswupper((wint_t)b))
+				b = (int)towlower((wint_t)b);
+		}
+		else
+#endif
+		{
+			if(a > 0 && a <= UCHAR_MAX && isupper(a))
+				a = tolower(a);
+			if(b > 0 && b <= UCHAR_MAX && isupper(b))
+				b = tolower(b);
+		}
 	}
 	return(a==b);
 }
@@ -78,8 +91,10 @@ static int charcmp(int a, int b, int nocase)
 static char *overlaid(register char *str,register const char *newstr,int nocase)
 {
 	register int c,d;
-	while((c= *(unsigned char *)str) && ((d= *(unsigned char*)newstr++),charcmp(c,d,nocase)))
-		str++;
+	char *strnext;
+	mbinit();
+	while((strnext = str, c = mbchar(strnext)) && (d = mbchar(newstr), charcmp(c,d,nocase)))
+		str = strnext;
 	if(*str)
 		*str = 0;
 	else if(*newstr==0)
@@ -98,6 +113,7 @@ static char *find_begin(char outbuff[], char *last, int endchar, int *type)
 	int		mode=*type;
 	bp = outbuff;
 	*type = 0;
+	mbinit();
 	while(cp < last)
 	{
 		xp = cp;
@@ -500,6 +516,7 @@ int ed_expand(Edit_t *ep, char outbuff[],int *cur,int *eol,int mode, int count)
 		/* first re-adjust cur */
 		c = outbuff[*cur];
 		outbuff[*cur] = 0;
+		mbinit();
 		for(out=outbuff; *out;n++)
 			mbchar(out);
 		outbuff[*cur] = c;

@hvdijk
Copy link
Author

hvdijk commented Mar 17, 2021

Yes, assuming that 'a' == L'a' should be fine, I'm pretty sure this assumption already exists in many other places in ksh. C doesn't guarantee this (look up __STDC_MB_MIGHT_NEQ_WC__) and I'm not sure whether POSIX does but if ksh already relies on it there's no harm in adding one more.

Elsewhere multibyte code is guarded by a SHOPT_MULTIBYTE check. I don't know whether you'd need that here. Aside from that, it looks like it should give correct results.

Style wise, feel free to completely disregard this, I have no stake in it, but: tolower and towlower are required to return the argument unchanged if it is not an uppercase letter, so the check for isupper and iswupper should not be necessary. In the non-wide case this may be done for portability concerns to non-conforming systems, but those concerns may not exist for the wide case. The casts to and from wint_t should not be necessary and are not done elsewhere in ksh. The a >= 0 && a <= UCHAR_MAX check is done elsewhere in ksh as (a&~STRIP)==0. All of those don't affect the results though, so if you prefer to leave those as they are I have no issue with that. :)

@McDutchie
Copy link

Yes, assuming that 'a' == L'a' should be fine, I'm pretty sure this assumption already exists in many other places in ksh. C doesn't guarantee this (look up __STDC_MB_MIGHT_NEQ_WC__) and I'm not sure whether POSIX does but if ksh already relies on it there's no harm in adding one more.

Well, I just learned something… "messy" is right.

ksh has explicit code to support EBCDIC in various places. It's probably broken, though, since I'm a mere mortal and don't have access to an IBM mainframe to test it.

Even if it isn't broken, I don't think the current EBCDIC code could possibly coexist with UTF-8, as ASCII is a subset of UTF-8, whereas EBCDIC support is hard-coded at the preprocessor level (many variants of things like #if (CC_NATIVE == CC_ASCII) or #if ('a'==97) in various places) so you can have either ASCII or EBCDIC but not both.

So I've been assuming an ASCII subset while fixing code that is specific to multibyte locales, e.g. here. Maybe that could be done in a better way.

Elsewhere multibyte code is guarded by a SHOPT_MULTIBYTE check. I don't know whether you'd need that here. Aside from that, it looks like it should give correct results.

I've actually removed many of the SHOPT_MULTIBYTE directives as they got on my nerves. Instead, I've used the fact that the commonly used multibyte macros directly or indirectly check if mbmax() > 1 and fall back to single-byte versions if it's equal to 1. The mbmax() macro expands to the maximum possible width in bytes of a character in the current locale. It's just a variable, so I've done this:

#if !SHOPT_MULTIBYTE
/*
* Disable multibyte without need for excessive '#if SHOPT_MULTIBYTE' peprocessor conditionals.
* If we redefine the maximum character size mbmax() as 1 byte, the mbwide() macro will always
* evaluate to 0. All the other multibyte macros have multibtye code conditional upon mbwide(),
* so the compiler should optimize all of that code away. See src/lib/libast/include/ast.h
*/
# undef mbmax
# define mbmax() 1
#endif

Hmm. I'll have to take another look at that. Maybe a few more macros need redefining since they don't in fact all check mbwide(). But it may not matter.

The casts to and from wint_t should not be necessary and are not done elsewhere in ksh.

Googling it gave me the impression that win_t could be defined as just about anything. If typecasts are not necessary then why did they bother with adding this type at all, instead of using something like int?

Thanks for the very helpful comments and suggestions.

@McDutchie
Copy link

The a >= 0 && a <= UCHAR_MAX check is done elsewhere in ksh as (a&~STRIP)==0.

a > UCHAR_MAX and/or b > UCHAR_MAX are possible if a multibyte locale is active. I added that check because there is support for systems that don't have towlower (not that I know if any such still exist), so they would fall back to tolower whose behaviour with out-of-range arguments is undefined.

@hvdijk
Copy link
Author

hvdijk commented Mar 17, 2021

Well, I just learned something… "messy" is right.

ksh has explicit code to support EBCDIC in various places. It's probably broken, though, since I'm a mere mortal and don't have access to an IBM mainframe to test it.

Even if it isn't broken, I don't think the current EBCDIC code could possibly coexist with UTF-8, as ASCII is a subset of UTF-8, whereas EBCDIC support is hard-coded at the preprocessor level (many variants of things like #if (CC_NATIVE == CC_ASCII) or #if ('a'==97) in various places) so you can have either ASCII or EBCDIC but not both.

Right, I'm pretty sure that POSIX does require somewhere that each character in the portable character set (not sure exactly which characters) has the same value in all locales, so EBCDIC cannot coexist with UTF-8. However, EBCDIC could potentially coexist with UTF-EBCDIC, in which case you would probably still have a Unicode wchar_t and hit that __STDC_MB_MIGHT_NEQ_WC__ case.

If I were in your position I would say that EBCDIC would be supported only in the non-multibyte case unless someone who wants to have it supported in the multibyte case provides you with a way to actually test that.

So I've been assuming an ASCII subset while fixing code that is specific to multibyte locales, e.g. here. Maybe that could be done in a better way.

That's actually what I do in my shell as well, but specifically because it helps one particular character set. In ja_JP.SJIS locales (that conflict with POSIX), the code point that corresponds to \ in US-ASCII is used for ¥ and converting it to a wide character produces U+00A5 rather than U+005C. There is no \ character, but ¥ fills its role everywhere. (IIRC, Japanese pre-Unicode versions of Windows also used a superset of SJIS and used ¥ as directory separator.) By adding a special hack to treat all bytes <128 as US-ASCII, that gets handled without an issue.

I've actually removed many of the SHOPT_MULTIBYTE directives as they got on my nerves. Instead, I've used the fact that the commonly used multibyte macros directly or indirectly check if mbmax() > 1 and fall back to single-byte versions if it's equal to 1. The mbmax() macro expands to the maximum possible width in bytes of a character in the current locale. It's just a variable, so I've done this:

#if !SHOPT_MULTIBYTE
/*
* Disable multibyte without need for excessive '#if SHOPT_MULTIBYTE' peprocessor conditionals.
* If we redefine the maximum character size mbmax() as 1 byte, the mbwide() macro will always
* evaluate to 0. All the other multibyte macros have multibtye code conditional upon mbwide(),
* so the compiler should optimize all of that code away. See src/lib/libast/include/ast.h
*/
# undef mbmax
# define mbmax() 1
#endif

Ah, nice. That makes sense to me.

The casts to and from wint_t should not be necessary and are not done elsewhere in ksh.

Googling it gave me the impression that win_t could be defined as just about anything. If typecasts are not necessary then why did they bother with adding this type at all, instead of using something like int?

The casts just make explicit the implicit conversions that would happen anyway.

For fully portable code, int may not be large enough to hold all wide characters, platforms that define a 16-bit int may still need a 32-bit wint_t. ksh already assumes that they do fit into int though.

a > UCHAR_MAX and/or b > UCHAR_MAX are possible if a multibyte locale is active.

I'm aware. STRIP is defined as 0377, so (a&~STRIP)==0 is equivalent to !((a<0)||(a>UCHAR_MAX)).

@McDutchie
Copy link

McDutchie commented Mar 17, 2021

Hrm. The overlaid and charcmp functions are only ever called with the nocase flag set to 1 if the readonly AST PATH_ATTRIBUTES configuration variable contains c:

if(saveout=astconf("PATH_ATTRIBUTES",saveout,(char*)0))
nocase = (strchr(saveout,'c')!=0);

And that variable only contains c when we're compiling on Windows (_WINIX):
#define OP_path_attributes 7
{
&dynamic[OP_path_attributes+1],
"PATH_ATTRIBUTES",
#if _WINIX
"c",
#else
&null[0],
#endif
&null[0],
0,
15,
CONF_AST,
CONF_READONLY,
OP_path_attributes
},

So, on every non-Windows platform, that whole case insensitivity business is currently dead code.

I added __APPLE__ to that directive, and it's working fine: I now have case-insensitive file name completion to match my case-insensitive file system.

However, on both Windows and macOS, it is possible to mount case-sensitive file systems, so case sensitivity can depend on the particular directory you're in. So, it looks to me like this whole approach is inherently broken, like so much of the AT&T stuff. Sigh.

There is also a libast pathexists function that uses PATH_ATTRIBUTES to decide whether to use case insensitive string comparisons when optimising to minimize stat(2) calls.. Thankfully, ksh doesn't use it.

Maybe the most sensible resolution for now would be to override my no-new-features policy again and add a shell option for case insensitivity in file name completion. If so, it should be turned on by default on both Windows and macOS since case insensitivity is the default setting on those platforms.

@McDutchie
Copy link

There is more going on. If I reduce charcmp to a simple return(a==b); and modify libast to include c in the PATH_ATTRIBUTES configuration variable, then I still get case-insensitive completion. So that code must be superfluous as it is apparently handled elsewhere as well. No idea where yet.

@McDutchie
Copy link

The answer would be that file name completion is based on globbing, and globbing itself is made case insensitive in libast based on PATH_ATTRIBUTES, here:

static int
gl_attr(glob_t* gp, const char* path, int flags)
{
return strchr(astconf("PATH_ATTRIBUTES", path, NiL), 'c') ? GLOB_ICASE : 0;
}

So, it looks like that whole charcmp function is superfluous and can be removed.

It also seems like the path does get passed as a parameter when querying the PATH_ATTRIBUTES variable, so that needs more investigation. Why did this not work on macOS even though its default file system is case insensitive? Maybe an actual probe for a case insensitive file system is not implemented?

@hvdijk
Copy link
Author

hvdijk commented Mar 17, 2021

Don't you still need it even with case-insensitive globbing? Suppose you have files ABC, abc, and abd. Suppose you've typed a and want to complete that. Globbing a* will give you all three files but if you then look at the glob's results in a case sensitive manner, you don't notice that they all start with ab so you don't complete it to that.

As for why it did not work on macOS, an actual probe for a case insensitive file system seems to be implemented for systems that support _PC_PATH_ATTRIBUTES, which seems to be (judging from a quick search) just AT&T's UWIN. It would require a custom implementation for other operating systems.

@McDutchie
Copy link

Don't you still need it even with case-insensitive globbing? Suppose you have files ABC, abc, and abd.

That is not possible on a case insensitive file system. ABC and abc are the same file.

Suppose you've typed a and want to complete that. Globbing a* will give you all three files but if you then look at the glob's results in a case sensitive manner, you don't notice that they all start with ab so you don't complete it to that.

Yes, I'm trying it out (with files ABCD, abc, and abd) and I see what you mean now. The menu still appears as expected on double-tab, after AB, but the initial AB completion is removed from the command line until you then pick an item from the menu.

As for why it did not work on macOS, an actual probe for a case insensitive file system seems to be implemented for systems that support _PC_PATH_ATTRIBUTES, which seems to be (judging from a quick search) just AT&T's UWIN. It would require a custom implementation for other operating systems.

It might be worth doing sometime.

Or maybe there is a portable way. The globbing routine could do an access check on the first encountered file that contains at least one alphabetic character. If there are no such files, it doesn't matter anyway, so use the case sensitive default. Otherwise, change its name to all uppercase, or (if it only contains uppercase) all lowercase. Then check if they are the same file in the manner of test x -ef X, and if so, mark the directory as case insensitive. Is this a crazy idea?

@hvdijk
Copy link
Author

hvdijk commented Mar 17, 2021

Is this a crazy idea?

It's probably the only the portable-ish way of doing it and should give accurate results, so not crazy. There may be different platform-specific solutions that you can use instead though that might be good enough. A search brings up that pathconf can take _PC_CASE_SENSITIVE on macOS or _PC_CASE_INSENSITIVE on Cygwin, though I have not tested these.

@McDutchie
Copy link

There may be different platform-specific solutions that you can use instead though that might be good enough. A search brings up that pathconf can take _PC_CASE_SENSITIVE on macOS or _PC_CASE_INSENSITIVE on Cygwin, though I have not tested these.

Well, that turns out to be rather simple to add. I haven't tested Cygwin yet but it seems to work like a charm on macOS. It even does the right thing for pathnames that span case-insensitive and case-sensitive file systems. The support for that was already coded, but only worked on UWIN so far.

--- a/src/lib/libast/port/astconf.c
+++ b/src/lib/libast/port/astconf.c
@@ -706,6 +706,12 @@ format(register Feature_t* fp, const char* path, const char* value, unsigned int
 				}
 			*s = 0;
 		}
+#elif _lib_pathconf && defined(_PC_CASE_SENSITIVE)
+		/* macOS: support UWIN 'c' attribute for file system case insensitivity */
+		fp->value = pathconf(path, _PC_CASE_SENSITIVE) == 0L ? "c" : null;
+#elif _lib_pathconf && defined(_PC_CASE_INSENSITIVE)
+		/* Cygwin: support UWIN 'c' attribute for file system case insensitivity. TODO: test */
+		fp->value = pathconf(path, _PC_CASE_INSENSITIVE) > 0L ? "c" : null;
 #endif
 		break;
 

@McDutchie
Copy link

McDutchie commented Mar 18, 2021

No shell on the Mac currently does case-insensitive globbing, though. The patch above makes it do the right thing depending on the file system, but I wonder if it would surprise/upset users who are used to case-sensitive globbing even on case-insensitive file systems. This could be a problem especially when using rm and the like.

@posguy99
Copy link

posguy99 commented Mar 19, 2021

Answering the question asked here , I think if you do that, you violate least surprise. The expectation is that what I typed, was used as the match.

Probably not one person in 100 ever thinks about the underlying filesystem being case-insensitive, ESPECIALLY if it is case-preserving, as HFS+ and APFS are... or whack ideas like NTFS where the underlying filesystem is case-sensitive but the API calls are not.

macOS Catalina:

$ cat reproducer.sh
touch A
touch AB
touch AbC

echo A*
echo AB*
$ ksh reproducer.sh 
A AB AbC
AB

Mint Linux:

mint $ cat reproducer.sh 
touch A
touch AB
touch AbC

echo A*
echo AB*
mint $ ksh reproducer.sh 
A AB AbC
AB

I would not expect the second echo to produce AB AbC on macOS and not on Mint, and I know perfectly well that one of those is case-insensitive and the other is not.

Worse would be if the behavior was different.

It's an issue of "correct" vs "right".

@McDutchie
Copy link

Thanks for your thoughtful reply. I tend to agree.

But I don't want to just delete the code that does this, either. The functionality is basically done, it was clearly intended. And personally I like it.

ksh would be the only shell were case-insensitive globbing properly depends on the case-insensitivity of the file system, detected separately for each pathname component so it does the right thing when mounting a case-sensitive filesystem in a directory on a case-insensitive file system, or vice versa. It's a nice selling point.

So, to avoid surprises, I think a new shell option (off by default) is probably the right way forward.

It's going to be a little more complicated than usual, as globbing is implemented entirely in libast, and ksh has basically nothing to do with it. So I'll need to add some way to allow ksh to influence libast's globbing behaviour.

McDutchie added a commit that referenced this issue Mar 22, 2021
One of the best-kept secrets of libast/ksh93 is that the code
includes support for case-insensitive file name generation (a.k.a.
pathname expansion, a.k.a. globbing) as well as case-insensitive
file name completion on interactive shells, depending on whether
the file system is case-insensitive or not. This is transparently
determined for each directory, so a path pattern that spans
multiple file systems can be part case-sensitive and part case-
insensitive. In more precise terms, each slash-separated path name
component pattern P is treated as ~(i:P) if its parent directory
exists on a case-insensitive file system. I recently discovered
this while dealing with <#223>.

However, that support is dead code on almost all current systems.
It depends on pathconf(2) having a _PC_PATH_ATTRIBUTES selector.
The 'c' attribute is supposedly returned if the given directory is
on a case insensitive file system. There are other attributes as
well (at least 'l', see src/lib/libcmd/rm.c). However, I have been
unable to find any system, current or otherwise, that has
_PC_PATH_ATTRIBUTES. Google and mailing list searches yield no
relevant results at all. If anyone knows of such a system, please
add a comment to this commit on GitHub, or email me.

An exception is Cygwin/Windows, on which the "c" attribute was
simply hardcoded, so globbing/completion is always case-
insensitive. As of Windows 10, that is wrong, as it added the
possibility to mount case-sensitive file systems.

On the other hand, this was never activated on the Mac, even
though macOS has always used a case-insensitive file like Windows.
But, being UNIX, it can also mount case-sensitive file systems.

Finally, Linux added the possibility to create individual case-
insensitive ext4 directories fairly recently, in version 5.2.
https://www.collabora.com/news-and-blog/blog/2020/08/27/using-the-linux-kernel-case-insensitive-feature-in-ext4/

So, since this functionality latently exists in the code base, and
three popular OSs now have relevant file system support, we might
as well make it usable on those systems. It's a nice idea, as it
intuitively makes sense for globbing and completion behaviour to
auto-adapt to file system case insensitivity on a per-directory
basis. No other shell does this, so it's a nice selling point, too.

However, the way it is coded, this is activated unconditionally on
supported systems. That is not a good idea. It will surprise users.
Since globbing is used with commands like 'rm', we do not want
surprises. So this commit makes it conditional upon a new shell
option called 'globcasedetect'. This option is only compiled into
ksh on systems where we can actually detect FS case insensitivity.

To implement this, libast needs some public API additions first.

*** libast changes ***

src/lib/libast/features/lib:
- Add probes for the linux/fs.h and sys/ioctl.h headers.
  Linux needs these to use ioctl(2) in pathicase(3) (see below).

src/lib/libast/path/pathicase.c,
src/lib/libast/include/ast.h,
src/lib/libast/man/path.3,
src/lib/libast/Mamfile:
- Add new pathicase(3) public API function. This uses whatever
  OS-specific method it can detect at compile time to determine if
  a particular path is on a case-insensitive file system. If no
  method is available, it only sets errno to ENOSYS and returns -1.
  Currently known to work on: macOS, Cygwin, Linux 5.2+, QNX 7.0+.
- On systems (if any) that have the mysterious _PC_PATH_ATTRIBUTES
  selector for pathconf(2), call astconf(3) and check for the 'c'
  attribute to determine case insensitivity. This should preserve
  compatibility with any such system.

src/lib/libast/port/astconf.c:
- dynamic[]: As case-insensitive globbing is now optional on all
  systems, do not set the 'c' attribute by default on _WINIX
  (Cygwin/Windows) systems.
- format(): On systems that do not have _PC_PATH_ATTRIBUTES, call
  pathicase(3) to determine the value for the "c" (case
  insensitive) attribute only. This is for compatibility as it is
  more efficient to call pathicase(3) directly.

src/lib/libast/misc/glob.c,
src/lib/libast/include/glob.h:
- Add new GLOB_DCASE public API flag to glob(3). This is like
  GLOB_ICASE (case-insensitive matching) except it only makes the
  match case-insensitive if the file system for the current
  pathname component is determined to be case-insensitive.
- gl_attr(): For efficiency, call pathicase(3) directly instead of
  via astconf(3).
- glob_dir(): Only call gl_attr() to determine file system case
  insensitivity if the GLOB_DCASE flag was passed. This makes case
  insensitive globbing optional on all systems.
- glob(): The options bitmask needs to be widened to fit the new
  GLOB_DCASE option. Define this centrally in a new GLOB_FLAGMASK
  macro so it is easy to change it along with GLOB_MAGIC (which
  uses the remaining bits for a sanity check bit pattern).

src/lib/libast/path/pathexists.c:
- For efficiency, call pathicase(3) directly instead of via
  astconf(3).

*** ksh changes ***

src/cmd/ksh93/features/options,
src/cmd/ksh93/SHOPT.sh:
- Add new SHOPT_GLOBCASEDET compile-time option. Set it to probe
  (empty) by default so that the shell option is compiled in on
  supported systems only, which is determined by new iffe feature
  test that checks if pathicase(3) returns an ENOSYS error.

src/cmd/ksh93/data/options.c,
src/cmd/ksh93/include/shell.h:
- Add -o globcasedetect shell option if compiling with
  SHOPT_GLOBCASEDET.

src/cmd/ksh93/sh/expand.c: path_expand():
- Pass the new GLOB_DCASE flag to glob(3) if the
  globcasedetect/SH_GLOBCASEDET shell option is set.

src/cmd/ksh93/edit/completion.c:
- While file listing/completion is based on globbing and
  automatically becomes case-insensitive when globbing does, it
  needs some additional handling to make a string comparison
  case-insensitive in corresponding cases. Otherwise, partial
  completions may be deleted from the command line upon pressing
  tab. This code was already in ksh 93u+ and just needs to be
  made conditional upon SHOPT_GLOBCASEDET and globcasedetect.
- For efficiency, call pathicase(3) directly instead of via
  astconf(3).

src/cmd/ksh93/sh.1:
- Document the new globcasedetect shell option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

3 participants