Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode filenames on Windows #537

Closed
tyomitch opened this issue Sep 13, 2020 · 9 comments
Closed

Support Unicode filenames on Windows #537

tyomitch opened this issue Sep 13, 2020 · 9 comments
Labels

Comments

@tyomitch
Copy link

As of now, fopenReadStream and fopenWriteStream call fopen(3) which on Windows maps to CreateFileA.
This makes it impossible to read or write files whose names include Unicode codepoints not representable in the user's default code page.

Is there any hope of upgrading Leptonica to use Unicode strings, either UTF-8 or UTF-16, for filenames, and to use _wfopen on Windows?
To avoid breaking backwards compatibility, each API entry point accepting a filename would need to be duplicated, e.g. as pixReadW, pixWriteW, pixaReadMultipageTiffW.

@DanBloomberg
Copy link
Owner

To answer your question, this suggestion is not practical for leptonica. Such filenames need to be converted outside the library.

Dan

@tyomitch
Copy link
Author

Dan, thank you for the quick response.
The problem is that conversion of filenames is impossible outside the library because there's no way fopen can open a file whose name includes Unicode codepoints not representable in the user's default code page.
The only workaround currently possible is to rename files into temporary ASCII names before and after processing by Leptonica.

@stweil
Copy link
Collaborator

stweil commented Sep 13, 2020

@tyomitch, is this true for any fopen on Windows? That function is part of the C library, so the exact implementation might differ depending on the C library used.

Do you have a link to some documentation which explains the described restriction?

@stweil
Copy link
Collaborator

stweil commented Sep 13, 2020

According to MS documentation, it is possible to set the code page to UTF-8. So any Windows program can set the desired code page and there seems to be no need to handle that mess in the Leptonica code.

@tyomitch
Copy link
Author

is this true for any fopen on Windows?

Not necessarily; as for the three stdlib implementations that README.md mentions (MSVC, MinGW, Cygwin):

MSVC reference: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen

The fopen function opens the file that is specified by filename. By default, a narrow filename string is interpreted using the ANSI codepage (CP_ACP). In Windows Desktop applications this can be changed to the OEM codepage (CP_OEMCP) by using the SetFileApisToOEM function. You can use the AreFileApisANSI function to determine whether filename is interpreted using the ANSI or the system default OEM codepage. _wfopen is a wide-character version of fopen; the arguments to _wfopen are wide-character strings. Otherwise, _wfopen and fopen behave identically.

MinGW AFAICT forwards calls to fopen to the stdlib it was itself compiled with.

Cygwin uses special escape sequences for a workaround, as documented in https://fossies.org/windows/misc/cygwin-20200909-src-x86_64.tar.xz:b/cygwin-snapshot-20200909-1/winsup/cygwin/strfuncs.cc lines 386-394

If a wide character in a filename has no representation in the current multibyte charset, then usually you wouldn't be able to access the file. To fix this problem, sys_wcstombs creates a replacement multibyte sequences for the non-representable wide-char. The sequence starts with an ASCII CAN (0x18, Ctrl-X), followed by the UTF-8 representation of the character. The sys_(cp_)mbstowcs function detects ASCII CAN characters in the input multibyte string and converts the following multibyte sequence in by treating it as an UTF-8 char. If that fails, the ASCII CAN was probably standalone and it gets just copied over as ASCII CAN.

According to MS documentation, it is possible to set the code page to UTF-8. So any Windows program can set the desired code page and there seems to be no need to handle that mess in the Leptonica code.

Only available in Windows Version 1903 (May 2019 Update) or above :-(

@amitdo
Copy link
Contributor

amitdo commented Jun 25, 2021

Dan, any reason to not close this issue ('Wontfix') ?

@DanBloomberg
Copy link
Owner

DanBloomberg commented Jun 25, 2021 via email

@amitdo
Copy link
Contributor

amitdo commented Jun 27, 2021

You forgot to click the 'Close issue' button.

@stweil stweil closed this as completed Jun 27, 2021
@stweil stweil added the wontfix label Jun 27, 2021
@DanBloomberg
Copy link
Owner

DanBloomberg commented Jun 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants