gh-94526: getpath_dirname() no longer encodes the path #97645

vstinner · 2022-09-29T13:11:00Z

Fix the Python path configuration used to initialized sys.path at
Python startup. getpath_basename() and getpath_dirname() functions no
longer encode the path to UTF-8/strict to avoid encoding errors if it
contains surrogate characters (created by decoding a bytes path with
the surrogateescape error handler).

The functions now use PyUnicode_FindChar() and PyUnicode_Substring()
on the Unicode path, rather than strrchr() on the encoded bytes
string.

Issue: Exception in ucs2lib_utf8_encoder in _bootstrap_python #94526

vstinner · 2022-09-29T13:17:31Z

Fix building Python in a non-ASCII path

Well. In fact, the issue is broader: no only _bootstrap_python is affected, any python program is affected since the Modules/getpath.c code is used by all Python executables.

vstinner · 2022-09-29T13:21:23Z

I rebased and updated the PR to clarify that this issue affects the Python path configuration (sys.path creation).

vstinner · 2022-09-29T13:29:39Z

Sadly, Modules/getpath.c is not a regular extension module, it cannot be loaded in test_getpath to easily write unit tests.

There are getpath_methods which are injected inside a namespace (dict) by funcs_to_dict() function.

It may be interesting to convert it to a regular extension module (_getpath?).

Misc/NEWS.d/next/Core and Builtins/2022-09-29-15-19-29.gh-issue-94526.wq5m6T.rst

serhiy-storchaka · 2022-09-29T13:40:59Z

Modules/getpath.c

-    const char *path;
-    if (!PyArg_ParseTuple(args, "s", &path)) {
+    PyObject *path;
+    if (!PyArg_ParseTuple(args, "U", &path)) {


BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

Why cannot they be implemented in Python?

BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

I tried to minimize the changes.

Why cannot they be implemented in Python?

Ask @zooba who designed this. Maybe it can be changed?

Perf, mostly. These trivial ones probably could be, but don't fall into the trap of trying to port the full ntpath/posixpath implementations into getpath - we don't have a lot of the functionality needed to handle those at this stage (e.g. no codecs, no os module).

Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string.

vstinner · 2022-09-29T13:54:31Z

@serhiy-storchaka:

The NEWS entry is meant to be read by a common Python user, not a core dev. getpath_basename and getpath_dirname are not Python function. Could you rewrite this?

I rephrased the NEWS entry to omit function names. Is it better? I only named functions in the commit message.

kumaraditya303 · 2022-09-29T14:48:21Z

Although this fixes the issue, this is a bit fragile since it can break if any other function were to use utf8 handler for encoding.

This PR avoids the case, can you add a comment that utf8 should be avoided here?

miss-islington · 2022-09-30T12:58:33Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

miss-islington · 2022-09-30T12:58:34Z

Sorry @vstinner, I had trouble checking out the 3.11 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 9f2f1dd131b912e224cd0269adde8879799686c4 3.11

vstinner · 2022-09-30T13:01:12Z

Although this fixes the issue, this is a bit fragile since it can break if any other function were to use utf8 handler for encoding.

Yes, a regression can be introduced again tomorrow. Well, we can fix it in this case :-)

This PR avoids the case, can you add a comment that utf8 should be avoided here?

I'm not sure about the intent of a comment explaining that UTF-8 should not be used, since the modified functions now use Unicode (no encode/decode).

miss-islington · 2022-09-30T13:02:19Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

bedevere-bot · 2022-09-30T13:02:25Z

GH-97677 is a backport of this pull request to the 3.11 branch.

…H-97645) Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string. (cherry picked from commit 9f2f1dd) Co-authored-by: Victor Stinner <[email protected]>

Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string. (cherry picked from commit 9f2f1dd) Co-authored-by: Victor Stinner <[email protected]>

…97645) Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string.

vstinner added the needs backport to 3.11 only security fixes label Sep 29, 2022

bedevere-bot added the awaiting core review label Sep 29, 2022

vstinner mentioned this pull request Sep 29, 2022

Exception in ucs2lib_utf8_encoder in _bootstrap_python #94526

Closed

serhiy-storchaka approved these changes Sep 29, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 29, 2022

serhiy-storchaka approved these changes Sep 29, 2022

View reviewed changes

vstinner merged commit 9f2f1dd into python:main Sep 30, 2022

bedevere-bot removed the awaiting merge label Sep 30, 2022

vstinner deleted the getpath_unicode branch September 30, 2022 12:58

miss-islington assigned vstinner Sep 30, 2022

vstinner added needs backport to 3.11 only security fixes and removed needs backport to 3.11 only security fixes labels Sep 30, 2022

bedevere-bot removed the needs backport to 3.11 only security fixes label Sep 30, 2022

kumaraditya303 mentioned this pull request Sep 30, 2022

GH-94526: Force utf8 encoding in _bootstrap_python #96889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-94526: getpath_dirname() no longer encodes the path #97645

gh-94526: getpath_dirname() no longer encodes the path #97645

vstinner commented Sep 29, 2022 •

edited

Loading

vstinner commented Sep 29, 2022

vstinner commented Sep 29, 2022

vstinner commented Sep 29, 2022

serhiy-storchaka Sep 29, 2022

vstinner Sep 29, 2022

zooba Sep 29, 2022

vstinner commented Sep 29, 2022

kumaraditya303 commented Sep 29, 2022

miss-islington commented Sep 30, 2022

miss-islington commented Sep 30, 2022

vstinner commented Sep 30, 2022

miss-islington commented Sep 30, 2022

bedevere-bot commented Sep 30, 2022

gh-94526: getpath_dirname() no longer encodes the path #97645

gh-94526: getpath_dirname() no longer encodes the path #97645

Conversation

vstinner commented Sep 29, 2022 • edited Loading

vstinner commented Sep 29, 2022

vstinner commented Sep 29, 2022

vstinner commented Sep 29, 2022

serhiy-storchaka Sep 29, 2022

Choose a reason for hiding this comment

vstinner Sep 29, 2022

Choose a reason for hiding this comment

zooba Sep 29, 2022

Choose a reason for hiding this comment

vstinner commented Sep 29, 2022

kumaraditya303 commented Sep 29, 2022

miss-islington commented Sep 30, 2022

miss-islington commented Sep 30, 2022

vstinner commented Sep 30, 2022

miss-islington commented Sep 30, 2022

bedevere-bot commented Sep 30, 2022

vstinner commented Sep 29, 2022 •

edited

Loading