Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-94526: Force utf8 encoding in _bootstrap_python #96889

Closed
wants to merge 5 commits into from

Conversation

kumaraditya303
Copy link
Contributor

@kumaraditya303 kumaraditya303 commented Sep 17, 2022

Before fix:

Current thread 0x00007f716f57f280 (most recent call first):
  <no Python frame>
make: *** [Makefile:1247: Python/frozen_modules/_collections_abc.h] Error 1
Exception ignored error evaluating path:
Traceback (most recent call last):
  File "<frozen getpath>", line 349, in <module>
ModuleNotFoundError: No module named 'encodings'
Fatal Python error: error evaluating path
Python runtime state: core initialized

Current thread 0x00007ff524566280 (most recent call first):
  <no Python frame>
make: *** [Makefile:1250: Python/frozen_modules/_sitebuiltins.h] Error 1

After fix it compiles successfully in paths containing non ascii characters.

Closes #94526

@kumaraditya303 kumaraditya303 added type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes and removed needs backport to 3.10 only security fixes labels Sep 17, 2022
@kumaraditya303 kumaraditya303 self-assigned this Sep 17, 2022
@gvanrossum gvanrossum requested a review from tiran September 18, 2022 03:30
@gvanrossum
Copy link
Member

I have to ask @tiran to review this, I don't know how that stuff works.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one minor suggestion.

Like @gvanrossum, I would like to know if @tiran has any concerns.

Programs/_bootstrap_python.c Outdated Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Comment on lines 66 to 73
PyPreConfig preconfig;
PyPreConfig_InitIsolatedConfig(&preconfig);
// Force utf8 encoding
preconfig.utf8_mode = 1;
status = Py_PreInitialize(&preconfig);
if (PyStatus_Exception(status)) {
Py_ExitStatusException(status);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr this PR initializes the preconfig differently, but that might be okay (or even what we actually want).

FWIW, there is a subtle difference here.

analysis:

before:

  1. initialize the config (PyConfig)
  2. set argv on the config
  3. call PyConfig_Read()
    a. call _Py_PreInitializeFromConfig()
    1. call _PyPreConfig_InitFromConfig() <-- initialize the preconfig
      a. call PyPreConfig_InitIsolatedConfig()
      b. call _PyPreConfig_GetConfig() -- copies from config:
      • parse_argv = 1 (set on the original line 71)
      • isolated = 0 (set on the original line 72)
      • use_environment = 0 (set in PyConfig_InitIsolatedConfig())
      • dev_mode = 0 (set in PyConfig_InitIsolatedConfig())
    2. call _Py_PreInitializeFromPyArgv() <-- use the preconfig
      a. call _PyPreConfig_Read() <-- uses argv from config
      b. call _PyPreConfig_Write() <-- updates the runtime state
  4. call Py_InitializeFromConfig()

after:

  1. initialize the preconfig (PyPreConfig)
    a. call PyPreConfig_InitIsolatedConfig() <-- initialize the preconfig
    • parse_argv = 0 (set in _PyPreConfig_InitCompatConfig())
    • isolated = 1
    • use_environment = 0
    • dev_mode = 0
      b. set utf8_mode
  2. call Py_PreInitialize()
    1. call _Py_PreInitializeFromPyArgv() <-- use the preconfig
      a. call _PyPreConfig_Read() <-- argv not used
      b. call _PyPreConfig_Write() <-- updates the runtime state
  3. initialize the config (PyConfig)
  4. set argv on the config
  5. call PyConfig_Read()
    a. call _Py_PreInitializeFromConfig()
    1. skip (runtime->preinitialized is true)
  6. call Py_InitializeFromConfig()

The key difference is how the preconfig is initialized:

  • before: with _PyPreConfig_InitFromConfig() (via PyConfig_Read()) using the already initialized config
  • after: with PyPreConfig_InitIsolatedConfig()

(When the preconfig is initialized is also different, but that doesn't really affect the outcome.)

Consequently, preconfig.parse_argv and preconfig.isolated are different (which slightly changes how the runtime is initialized) and _PyPreCmdline_SetArgv() is never called (in _PyPreConfig_Read()).

If we wanted the how to be strictly equivalent, we'd initialize the preconfig after the config:

    PyConfig config;
    PyConfig_InitIsolatedConfig(&config);
    // don't warn, pybuilddir.txt does not exist yet
    config.pathconfig_warnings = 0;
    // parse arguments
    config.parse_argv = 1;
    // add current script dir to sys.path
    config.isolated = 0;
    config.safe_path = 0;
#ifdef MS_WINDOWS
    status = PyConfig_SetArgv(&config, argc, argv);
#else
    status = PyConfig_SetBytesArgv(&config, argc, argv);
#endif
    if (PyStatus_Exception(status)) {
        goto error;
    }

    PyPreConfig preconfig;
    _PyPreConfig_InitFromConfig(&preconfig, &config);
    // Force utf8 encoding
    preconfig.utf8_mode = 1;
    _PyArgv config_args = {
        .use_bytes_argv = 0,
        .argc = config->argv.length,
        .wchar_argv = config->argv.items};
    status = _Py_PreInitializeFromPyArgv(&preconfig, &config_args);
    if (PyStatus_Exception(status)) {
        Py_ExitStatusException(status);
    }

    status = PyConfig_Read(&config);

All that said, it isn't clear that difference in runtime init is bad. In fact, it might be what we want for _bootstrap_python. The difference might also not matter for _bootstrap_python, especially for how we're parsing argv. Regardless, my point is that it isn't clear that these details were considered. It would be worth making sure that the resulting runtime state is something we're okay with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW for what _bootstrap_python is used, which is just to run one or two scripts this seems sufficient. It isn't something user facing or anything and it is "internal". I am hoping @vstinner can take a look as he authored the PyConfig PEP https://peps.python.org/pep-0587/ and he is most familiar with these APIs.

I did this PR mostly by reading the source code and in particular this example from the PEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must not and cannot pre-initialize (PyPreConfig) Python once it's already initialized (PyConfig).

PyConfig_Read() does not pre-initialize Python if it's already pre-initialized. (If it's the case, it's would a bug.)

PyStatus
_Py_PreInitializeFromConfig(const PyConfig *config,
                            const _PyArgv *args)
{
    ...
    if (runtime->preinitialized) {
        /* Already initialized: do nothing */
        return _PyStatus_OK();
    }
    ...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling PyConfig_SetArgv() first to then call _Py_PreInitializeFromPyArgv() doesn't work, since PyConfig_SetArgv() does pre-initialize Python :-)

https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.PyConfig_SetArgv

@vstinner
Copy link
Member

Consequently, preconfig.parse_argv and preconfig.isolated are different (which slightly changes how the runtime is initialized) and _PyPreCmdline_SetArgv() is never called (in _PyPreConfig_Read()).

It's unfortunate that PyPreConfig and PyConfig are separated and that some options are redundant. You might have to set the same options in the two structures:

preconfig.isolated = 0;
preconfig.parse_argv = 1;
...
config.isolated = 0;
config.parse_argv = 1;

But I'm not sure why you start from an isolated config to disable isolation and use parse_argv=1. If you want a Python which looks like regular Python, use the PythonConfig rather than the IsolatedConfig.

@vstinner
Copy link
Member

PyPreConfig parses argv to set isolated (-I), use_environment (-E) and dev_mode (-X utf8 or -X utf8=1). But it seems like _bootstrap_python should not depend on environment variables and should not enable the development mode.

I'm curious why isolated=0 is used. Currently, _bootstrap_python.c uses:

    // add current script dir to sys.path
    config.isolated = 0;

Hum, that's the purpose of the new safe_path option. Problem: if isolated=1, safe_path is always set to 1. Currently, it's not possible to have isolated=1 and safe_path=0. Maybe the API should be enhanced to support this use case. In general, I tried to always take the highest priority for PyConfig members if a member is set explicitly.

@kumaraditya303
Copy link
Contributor Author

@vstinner: I agree with your analysis (after all you are the author :)). I was also confused config.isolated = 0; but this PR needs to be backported so I left it for another PR to fix up.

Can you review the utf8 part and if it is set correctly? Thank you!

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced that this change is the correct fix for the root issue. Please see my analysis in my comment #94526 (comment) and my PR #97645 which fix the root issue.

@kumaraditya303
Copy link
Contributor Author

Superseded by #97645

@kumaraditya303 kumaraditya303 deleted the utf8-deepfreeze branch September 30, 2022 13:06
@vstinner
Copy link
Member

IMO #97645 fixed the root issue and this change is no longer needed.

I'm not convinced that _bootstrap_python requires to enforce the UTF-8 Mode, since I spent significant time over the last years to make sure that it's possible to start Python with basically any possible locale encoding ;-) During Python startup, Py_EncodeLocale() and Py_DecodeLocale() use the C library (libc) to encode/decode strings, until Python gets access to its own codecs.

@vstinner
Copy link
Member

At least, for me, it was interesting to see how the Python preconfiguration API can be used :-D

@gvanrossum
Copy link
Member

Well, I'm just glad I didn't blindly accepted it. :-)

@kumaraditya303
Copy link
Contributor Author

As long as the bug is fixed I'm happy :-)

But as I pointed out in the other PR that the fix only fixed only specific cases, it is possible that this happens again if runtime initialization changes. This PR would have avoided that entirely but it is rare so it should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes interpreter-core (Objects, Python, Grammar, and Parser dirs) needs backport to 3.11 only security fixes type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception in ucs2lib_utf8_encoder in _bootstrap_python
5 participants