Support Unicode filenames on Windows #537

tyomitch · 2020-09-13T16:34:31Z

As of now, fopenReadStream and fopenWriteStream call fopen(3) which on Windows maps to CreateFileA.
This makes it impossible to read or write files whose names include Unicode codepoints not representable in the user's default code page.

Is there any hope of upgrading Leptonica to use Unicode strings, either UTF-8 or UTF-16, for filenames, and to use _wfopen on Windows?
To avoid breaking backwards compatibility, each API entry point accepting a filename would need to be duplicated, e.g. as pixReadW, pixWriteW, pixaReadMultipageTiffW.

The text was updated successfully, but these errors were encountered:

DanBloomberg · 2020-09-13T18:05:41Z

To answer your question, this suggestion is not practical for leptonica. Such filenames need to be converted outside the library.

Dan

tyomitch · 2020-09-13T18:21:09Z

Dan, thank you for the quick response.
The problem is that conversion of filenames is impossible outside the library because there's no way fopen can open a file whose name includes Unicode codepoints not representable in the user's default code page.
The only workaround currently possible is to rename files into temporary ASCII names before and after processing by Leptonica.

stweil · 2020-09-13T19:14:39Z

@tyomitch, is this true for any fopen on Windows? That function is part of the C library, so the exact implementation might differ depending on the C library used.

Do you have a link to some documentation which explains the described restriction?

stweil · 2020-09-13T19:30:24Z

According to MS documentation, it is possible to set the code page to UTF-8. So any Windows program can set the desired code page and there seems to be no need to handle that mess in the Leptonica code.

tyomitch · 2020-09-13T20:12:29Z

is this true for any fopen on Windows?

Not necessarily; as for the three stdlib implementations that README.md mentions (MSVC, MinGW, Cygwin):

MSVC reference: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen

The fopen function opens the file that is specified by filename. By default, a narrow filename string is interpreted using the ANSI codepage (CP_ACP). In Windows Desktop applications this can be changed to the OEM codepage (CP_OEMCP) by using the SetFileApisToOEM function. You can use the AreFileApisANSI function to determine whether filename is interpreted using the ANSI or the system default OEM codepage. _wfopen is a wide-character version of fopen; the arguments to _wfopen are wide-character strings. Otherwise, _wfopen and fopen behave identically.

MinGW AFAICT forwards calls to fopen to the stdlib it was itself compiled with.

Cygwin uses special escape sequences for a workaround, as documented in https://fossies.org/windows/misc/cygwin-20200909-src-x86_64.tar.xz:b/cygwin-snapshot-20200909-1/winsup/cygwin/strfuncs.cc lines 386-394

If a wide character in a filename has no representation in the current multibyte charset, then usually you wouldn't be able to access the file. To fix this problem, sys_wcstombs creates a replacement multibyte sequences for the non-representable wide-char. The sequence starts with an ASCII CAN (0x18, Ctrl-X), followed by the UTF-8 representation of the character. The sys_(cp_)mbstowcs function detects ASCII CAN characters in the input multibyte string and converts the following multibyte sequence in by treating it as an UTF-8 char. If that fails, the ASCII CAN was probably standalone and it gets just copied over as ASCII CAN.

According to MS documentation, it is possible to set the code page to UTF-8. So any Windows program can set the desired code page and there seems to be no need to handle that mess in the Leptonica code.

Only available in Windows Version 1903 (May 2019 Update) or above :-(

amitdo · 2021-06-25T15:58:01Z

Dan, any reason to not close this issue ('Wontfix') ?

DanBloomberg · 2021-06-25T18:43:17Z

No reason not to. "Won't fix" is accurate.

…

On Fri, Jun 25, 2021 at 8:58 AM Amit D. ***@***.***> wrote: Dan, any reason to not close this issue ('Wontfix') ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#537 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7KMLCVAR5CIWH2NR7ICGTTUSRRHANCNFSM4RKYC3WA> .

amitdo · 2021-06-27T08:59:34Z

You forgot to click the 'Close issue' button.

DanBloomberg · 2021-06-27T17:06:34Z

Hey Amit -- I was going to give you that pleasure :-)

…

On Sun, Jun 27, 2021 at 1:59 AM Amit D. ***@***.***> wrote: You forgot to click the 'Close issue' button. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#537 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7KMLAFHKTHCDMKL2XWMDLTU3SABANCNFSM4RKYC3WA> .

nguyenq mentioned this issue Nov 26, 2020

win10 chinese filename. TesseractException: Error during processing page. nguyenq/tess4j#75

Open

stweil closed this as completed Jun 27, 2021

stweil added the wontfix label Jun 27, 2021

amitdo mentioned this issue Jan 4, 2022

Add support for Unicode filenames on MS Windows tesseract-ocr/tesseract#3709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode filenames on Windows #537

Support Unicode filenames on Windows #537

tyomitch commented Sep 13, 2020

DanBloomberg commented Sep 13, 2020

tyomitch commented Sep 13, 2020

stweil commented Sep 13, 2020

stweil commented Sep 13, 2020

tyomitch commented Sep 13, 2020

amitdo commented Jun 25, 2021

DanBloomberg commented Jun 25, 2021 via email

amitdo commented Jun 27, 2021

DanBloomberg commented Jun 27, 2021 via email

Support Unicode filenames on Windows #537

Support Unicode filenames on Windows #537

Comments

tyomitch commented Sep 13, 2020

DanBloomberg commented Sep 13, 2020

tyomitch commented Sep 13, 2020

stweil commented Sep 13, 2020

stweil commented Sep 13, 2020

tyomitch commented Sep 13, 2020

amitdo commented Jun 25, 2021

DanBloomberg commented Jun 25, 2021 via email

amitdo commented Jun 27, 2021

DanBloomberg commented Jun 27, 2021 via email