Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in ucs2lib_utf8_encoder in _bootstrap_python #94526

Closed
pgy opened this issue Jul 3, 2022 · 2 comments
Closed

Exception in ucs2lib_utf8_encoder in _bootstrap_python #94526

pgy opened this issue Jul 3, 2022 · 2 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@pgy
Copy link

pgy commented Jul 3, 2022

Bug report

I am building python in a directory that has non-ascii name on linux and get the following error:

./_bootstrap_python ./Programs/_freeze_module.py abc ./Lib/abc.py Python/frozen_modules/abc.h
Exception ignored error evaluating path:
Traceback (most recent call last):
  File "<frozen getpath>", line 349, in <module>
ModuleNotFoundError: No module named 'encodings'
Fatal Python error: error evaluating path
Python runtime state: core initialized

Current thread 0x00007f16d6c10740 (most recent call first):
  <no Python frame>
make: *** [Makefile:1218: Python/frozen_modules/abc.h] Error 1

The problem seems to be that the limited environment used to execute the precompiled Modules/getpath.py does not have the encodings module.

I tracked the exception to a PyImport_ImportModule("encodings"); call in Python/codecs.c which is a consequence of a PyCodec_LookupError that happens in ucs2lib_utf8_encoder which I think gets called in getpath_dirname.

The original string is /home/pgy/letöltések/cpython/_bootstrap_python, and the unicode object ucs2lib_utf8_encoder gets as an argument is:

(rr) p PyObject_Print(unicode, stderr, 0)
'/home/pgy/let\udcc3\udcb6lt\udcc3\udca9sek/cpython/_bootstrap_python'

Your environment

  • CPython versions tested on: 7db1d2e
  • Operating system and architecture: Linux hostname 5.15.44-1-lts #1 SMP Mon, 30 May 2022 13:45:47 +0000 x86_64 GNU/Linux
@pgy pgy added the type-bug An unexpected behavior, bug, or error label Jul 3, 2022
@kumaraditya303 kumaraditya303 self-assigned this Sep 17, 2022
@kumaraditya303 kumaraditya303 added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes 3.12 bugs and security fixes labels Sep 17, 2022
@vstinner
Copy link
Member

I created PR #97645 to fix the root issue.

I can reproduce the issue:

  • Clone Python Git repository in a non-ASCII directory: /home/vstinner/python/mainé (é is non-ASCII)
  • Build Python: ./configure --with-pydebug && make
  • make fails with: ModuleNotFoundError: No module named 'encodings'

My locale encoding is UTF-8, my LC_CTYPE locale is fr_FR.UTF-8.

The problem is that the getpath_dirname() function of Modules/getpath.c encodes the Unicode path to UTF-8/strict using s format and PyArg_ParseTuple(), whereas a path can contain surrogate characters.

The getpath_dirname() and getpath_basename() convert their Unicode input string to bytes just to be able to use strrchr(path, SEP) since SEP is a bytes string. There is no need to convert these strings to bytes, whereas the conversion is the root issue here.

vstinner added a commit that referenced this issue Sep 30, 2022
Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 30, 2022
…H-97645)

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
(cherry picked from commit 9f2f1dd)

Co-authored-by: Victor Stinner <[email protected]>
@kumaraditya303 kumaraditya303 removed their assignment Sep 30, 2022
@vstinner
Copy link
Member

Fixed by PR #97645. Backport to 3.11: PR #97677.

miss-islington added a commit that referenced this issue Sep 30, 2022
Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
(cherry picked from commit 9f2f1dd)

Co-authored-by: Victor Stinner <[email protected]>
serhiy-storchaka pushed a commit to serhiy-storchaka/cpython that referenced this issue Oct 2, 2022
…97645)

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants