Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-94526: getpath_dirname() no longer encodes the path #97645

Merged
merged 1 commit into from
Sep 30, 2022
Merged

gh-94526: getpath_dirname() no longer encodes the path #97645

merged 1 commit into from
Sep 30, 2022

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 29, 2022

Fix the Python path configuration used to initialized sys.path at
Python startup. getpath_basename() and getpath_dirname() functions no
longer encode the path to UTF-8/strict to avoid encoding errors if it
contains surrogate characters (created by decoding a bytes path with
the surrogateescape error handler).

The functions now use PyUnicode_FindChar() and PyUnicode_Substring()
on the Unicode path, rather than strrchr() on the encoded bytes
string.

@vstinner
Copy link
Member Author

Fix building Python in a non-ASCII path

Well. In fact, the issue is broader: no only _bootstrap_python is affected, any python program is affected since the Modules/getpath.c code is used by all Python executables.

@vstinner
Copy link
Member Author

I rebased and updated the PR to clarify that this issue affects the Python path configuration (sys.path creation).

@vstinner
Copy link
Member Author

Sadly, Modules/getpath.c is not a regular extension module, it cannot be loaded in test_getpath to easily write unit tests.

There are getpath_methods which are injected inside a namespace (dict) by funcs_to_dict() function.

It may be interesting to convert it to a regular extension module (_getpath?).

const char *path;
if (!PyArg_ParseTuple(args, "s", &path)) {
PyObject *path;
if (!PyArg_ParseTuple(args, "U", &path)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

Why cannot they be implemented in Python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

I tried to minimize the changes.

Why cannot they be implemented in Python?

Ask @zooba who designed this. Maybe it can be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf, mostly. These trivial ones probably could be, but don't fall into the trap of trying to port the full ntpath/posixpath implementations into getpath - we don't have a lot of the functionality needed to handle those at this stage (e.g. no codecs, no os module).

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
@vstinner
Copy link
Member Author

@serhiy-storchaka:

The NEWS entry is meant to be read by a common Python user, not a core dev. getpath_basename and getpath_dirname are not Python function. Could you rewrite this?

I rephrased the NEWS entry to omit function names. Is it better? I only named functions in the commit message.

@kumaraditya303
Copy link
Contributor

Although this fixes the issue, this is a bit fragile since it can break if any other function were to use utf8 handler for encoding.

This PR avoids the case, can you add a comment that utf8 should be avoided here?

@vstinner vstinner merged commit 9f2f1dd into python:main Sep 30, 2022
@miss-islington
Copy link
Contributor

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry @vstinner, I had trouble checking out the 3.11 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 9f2f1dd131b912e224cd0269adde8879799686c4 3.11

@vstinner vstinner deleted the getpath_unicode branch September 30, 2022 12:58
@vstinner
Copy link
Member Author

Although this fixes the issue, this is a bit fragile since it can break if any other function were to use utf8 handler for encoding.

Yes, a regression can be introduced again tomorrow. Well, we can fix it in this case :-)

This PR avoids the case, can you add a comment that utf8 should be avoided here?

I'm not sure about the intent of a comment explaining that UTF-8 should not be used, since the modified functions now use Unicode (no encode/decode).

@vstinner vstinner added needs backport to 3.11 only security fixes and removed needs backport to 3.11 only security fixes labels Sep 30, 2022
@miss-islington
Copy link
Contributor

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-97677 is a backport of this pull request to the 3.11 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.11 only security fixes label Sep 30, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 30, 2022
…H-97645)

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
(cherry picked from commit 9f2f1dd)

Co-authored-by: Victor Stinner <[email protected]>
miss-islington added a commit that referenced this pull request Sep 30, 2022
Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
(cherry picked from commit 9f2f1dd)

Co-authored-by: Victor Stinner <[email protected]>
serhiy-storchaka pushed a commit to serhiy-storchaka/cpython that referenced this pull request Oct 2, 2022
…97645)

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants