-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-94526: Force utf8 encoding in _bootstrap_python #96889
Conversation
I have to ask @tiran to review this, I don't know how that stuff works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with one minor suggestion.
Like @gvanrossum, I would like to know if @tiran has any concerns.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Co-authored-by: Eric Snow <[email protected]>
Programs/_bootstrap_python.c
Outdated
PyPreConfig preconfig; | ||
PyPreConfig_InitIsolatedConfig(&preconfig); | ||
// Force utf8 encoding | ||
preconfig.utf8_mode = 1; | ||
status = Py_PreInitialize(&preconfig); | ||
if (PyStatus_Exception(status)) { | ||
Py_ExitStatusException(status); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr this PR initializes the preconfig differently, but that might be okay (or even what we actually want).
FWIW, there is a subtle difference here.
analysis:
before:
- initialize the config (
PyConfig
) - set argv on the config
- call
PyConfig_Read()
a. call_Py_PreInitializeFromConfig()
- call
_PyPreConfig_InitFromConfig()
<-- initialize the preconfig
a. callPyPreConfig_InitIsolatedConfig()
b. call_PyPreConfig_GetConfig()
-- copies from config:parse_argv
= 1 (set on the original line 71)isolated
= 0 (set on the original line 72)use_environment
= 0 (set inPyConfig_InitIsolatedConfig()
)dev_mode
= 0 (set inPyConfig_InitIsolatedConfig()
)
- call
_Py_PreInitializeFromPyArgv()
<-- use the preconfig
a. call_PyPreConfig_Read()
<-- uses argv from config
b. call_PyPreConfig_Write()
<-- updates the runtime state
- call
- call
Py_InitializeFromConfig()
after:
- initialize the preconfig (
PyPreConfig
)
a. callPyPreConfig_InitIsolatedConfig()
<-- initialize the preconfigparse_argv
= 0 (set in_PyPreConfig_InitCompatConfig()
)isolated
= 1use_environment
= 0dev_mode
= 0
b. set utf8_mode
- call
Py_PreInitialize()
- call
_Py_PreInitializeFromPyArgv()
<-- use the preconfig
a. call_PyPreConfig_Read()
<-- argv not used
b. call_PyPreConfig_Write()
<-- updates the runtime state
- call
- initialize the config (
PyConfig
) - set argv on the config
- call
PyConfig_Read()
a. call_Py_PreInitializeFromConfig()
- skip (
runtime->preinitialized
is true)
- skip (
- call
Py_InitializeFromConfig()
The key difference is how the preconfig is initialized:
- before: with
_PyPreConfig_InitFromConfig()
(viaPyConfig_Read()
) using the already initialized config - after: with
PyPreConfig_InitIsolatedConfig()
(When the preconfig is initialized is also different, but that doesn't really affect the outcome.)
Consequently, preconfig.parse_argv
and preconfig.isolated
are different (which slightly changes how the runtime is initialized) and _PyPreCmdline_SetArgv()
is never called (in _PyPreConfig_Read()
).
If we wanted the how to be strictly equivalent, we'd initialize the preconfig after the config:
PyConfig config;
PyConfig_InitIsolatedConfig(&config);
// don't warn, pybuilddir.txt does not exist yet
config.pathconfig_warnings = 0;
// parse arguments
config.parse_argv = 1;
// add current script dir to sys.path
config.isolated = 0;
config.safe_path = 0;
#ifdef MS_WINDOWS
status = PyConfig_SetArgv(&config, argc, argv);
#else
status = PyConfig_SetBytesArgv(&config, argc, argv);
#endif
if (PyStatus_Exception(status)) {
goto error;
}
PyPreConfig preconfig;
_PyPreConfig_InitFromConfig(&preconfig, &config);
// Force utf8 encoding
preconfig.utf8_mode = 1;
_PyArgv config_args = {
.use_bytes_argv = 0,
.argc = config->argv.length,
.wchar_argv = config->argv.items};
status = _Py_PreInitializeFromPyArgv(&preconfig, &config_args);
if (PyStatus_Exception(status)) {
Py_ExitStatusException(status);
}
status = PyConfig_Read(&config);
All that said, it isn't clear that difference in runtime init is bad. In fact, it might be what we want for _bootstrap_python
. The difference might also not matter for _bootstrap_python
, especially for how we're parsing argv
. Regardless, my point is that it isn't clear that these details were considered. It would be worth making sure that the resulting runtime state is something we're okay with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW for what _bootstrap_python
is used, which is just to run one or two scripts this seems sufficient. It isn't something user facing or anything and it is "internal". I am hoping @vstinner can take a look as he authored the PyConfig
PEP https://peps.python.org/pep-0587/ and he is most familiar with these APIs.
I did this PR mostly by reading the source code and in particular this example from the PEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must not and cannot pre-initialize (PyPreConfig) Python once it's already initialized (PyConfig).
PyConfig_Read() does not pre-initialize Python if it's already pre-initialized. (If it's the case, it's would a bug.)
PyStatus
_Py_PreInitializeFromConfig(const PyConfig *config,
const _PyArgv *args)
{
...
if (runtime->preinitialized) {
/* Already initialized: do nothing */
return _PyStatus_OK();
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling PyConfig_SetArgv() first to then call _Py_PreInitializeFromPyArgv() doesn't work, since PyConfig_SetArgv() does pre-initialize Python :-)
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.PyConfig_SetArgv
It's unfortunate that PyPreConfig and PyConfig are separated and that some options are redundant. You might have to set the same options in the two structures:
But I'm not sure why you start from an isolated config to disable isolation and use parse_argv=1. If you want a Python which looks like regular Python, use the PythonConfig rather than the IsolatedConfig. |
PyPreConfig parses argv to set isolated ( I'm curious why isolated=0 is used. Currently, _bootstrap_python.c uses:
Hum, that's the purpose of the new safe_path option. Problem: if isolated=1, safe_path is always set to 1. Currently, it's not possible to have isolated=1 and safe_path=0. Maybe the API should be enhanced to support this use case. In general, I tried to always take the highest priority for PyConfig members if a member is set explicitly. |
@vstinner: I agree with your analysis (after all you are the author :)). I was also confused Can you review the utf8 part and if it is set correctly? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced that this change is the correct fix for the root issue. Please see my analysis in my comment #94526 (comment) and my PR #97645 which fix the root issue.
Superseded by #97645 |
IMO #97645 fixed the root issue and this change is no longer needed. I'm not convinced that _bootstrap_python requires to enforce the UTF-8 Mode, since I spent significant time over the last years to make sure that it's possible to start Python with basically any possible locale encoding ;-) During Python startup, Py_EncodeLocale() and Py_DecodeLocale() use the C library (libc) to encode/decode strings, until Python gets access to its own codecs. |
At least, for me, it was interesting to see how the Python preconfiguration API can be used :-D |
Well, I'm just glad I didn't blindly accepted it. :-) |
As long as the bug is fixed I'm happy :-) But as I pointed out in the other PR that the fix only fixed only specific cases, it is possible that this happens again if runtime initialization changes. This PR would have avoided that entirely but it is rare so it should be fine. |
Before fix:
After fix it compiles successfully in paths containing non ascii characters.
Closes #94526