Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iconv /w locale and UTF-8 input charset #9149

Closed

Conversation

mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Jul 26, 2022

can someone please help me understand what is going on with iconv when locale is set?

demo: https://3v4l.org/RmIZA

var_dump(setlocale(LC_ALL, "en_US.utf8"));
$in_charset          = 'UTF-8';
$out_charset 	     = 'ASCII//TRANSLIT';
$string_to_translate = 'Žluťoučký kůň\n';

$string_out = iconv($in_charset, $out_charset, $string_to_translate);

var_dump($string_out);

var_dump(setlocale(LC_ALL, "C"));
$in_charset          = 'UTF-8';
$out_charset 	     = 'ASCII//TRANSLIT';
$string_to_translate = 'Žluťoučký kůň\n';

$string_out = iconv($in_charset, $out_charset, $string_to_translate);

var_dump($string_out);

on linux outputs:

string(10) "en_US.utf8"
string(15) "Zlutoucky kun\n"
string(1) "C"
string(15) "?lu?ou?k? k??\n"

on Windows:

bool(false)
string(16) "Zlutouck'y kun\n"
string(1) "C"
string(16) "Zlutouck'y kun\n"

on Alpine OS/musl libc:

string(10) "en_US.utf8"
string(16) "Zlutouck'y kun\n"
string(1) "C"
string(16) "Zlutouck'y kun\n"

The input is always in UTF-8, so it should be independent on locale. The expected output Zlutoucky kun is in lower/7-bit ASCII only, thus the locale should not affect the result. But for some reasons, it does, and on Windows/Alpine OS/FreeBSD the result is independent on the locale, but always wrong.

It seems like the iconv does "parse" the input differently/wrong based on locale even if the input encoding is specified as UTF-8.

@mvorisek mvorisek force-pushed the iconv_must_not_depend_on_locale branch from cd13c25 to e1bfa7d Compare July 26, 2022 07:50
@mvorisek mvorisek changed the title Fix iconv /w locale conversion Fix iconv /w locale and UTF-8 input charset Jul 26, 2022
@cmb69
Copy link
Member

cmb69 commented Jul 26, 2022

I'm not sure that "Zlutouck'y kun\n" is wrong. Apparently, bot "Žluťoučký" and "Žluťoučky" are Czech words but with different meanings (please correct me if I'm wrong), and as such the apostrophe may be added deliberately to be able to distinguish both.

Anyhow, I don't think this really depends on the operating system, but rather on the ICONV_IMPL (and maybe the ICONV_VERSION). 3v4l.org uses glibc's; our Windows builds use libiconv.

@mvorisek
Copy link
Contributor Author

I am native Czech speaker 😃 and Ž, ť, č, ý characters are characters with "Czech diacritics". Such diacritics only soften or prolong the pronounciation, but have no other meaning, only the Zlutoucky is correct - in Czech language, you simply/intuitively strip the diacritics above standard/English letters. Zlutouck'y has no interpretation in the Czech language at all, ' (single quote) is not even valid symbol in the Czech language (only double quotes are officially valid).

But I opened this issue not because of this difference althought is it wrong, but because in regular linux /w glibc, the output depends on the locale, but as long as the input encoding is UTF-8 and the target encoding is ASCII (and the result should have only 7-bit ASCII characters), it must not depend on locale, see the first example:

string(10) "en_US.utf8"
string(15) "Zlutoucky kun\n"
string(1) "C"
string(15) "?lu?ou?k? k??\n"

@cmb69
Copy link
Member

cmb69 commented Jul 26, 2022

I don't think this is a PHP issue, and as such we likely can't do anything about it. The locale issue is likely something that would need to be changed in glibc (or we could drop supporting anything but libiconv, but that may cause major headaches for users and distro managers). The transliteration of ý to 'y is likely a feature of libiconv; ö is transliterated to "o.

@mvorisek
Copy link
Contributor Author

So even the ?lu?ou?k? k??\n with standard glibc cannot be solved in php-src? If not, close this PR.

@cmb69
Copy link
Member

cmb69 commented Jul 27, 2022

So even the ?lu?ou?k? k??\n with standard glibc cannot be solved in php-src?

I'm not a 100% sure about that. One would need to debug, or at least check the glibc iconv() implementation. They may use some ctype functions, such as isprint(3), and these are locale aware.

@mvorisek
Copy link
Contributor Author

I am closing this PR as I cannot solve it, but the results are very strange which make iconv almost unusable in production as long specific locale cannot be guaranteed.

@mvorisek mvorisek closed this Apr 19, 2023
@mvorisek mvorisek deleted the iconv_must_not_depend_on_locale branch April 19, 2023 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants