Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup object creation #14091

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

bonzini
Copy link
Collaborator

@bonzini bonzini commented Jan 8, 2025

When building QEMU, about 10% of the time is spent in the various __init__ functions for InterpreterObject subclasses, especially primitives:

    23264    0.507 interpreter/primitives/boolean.py:23(__init__)
    17283    0.518 interpreter/primitives/array.py:30(__init__)
    11259    0.765 interpreter/primitives/integer.py:18(__init__)
   145526    0.886 interpreterbase/baseobjects.py:129(__init__) <-- ObjectHolder
   147974    1.930 interpreterbase/baseobjects.py:41(__init__)
    80265    5.413 interpreter/primitives/string.py:32(__init__)
  1642740    1.700 /usr/lib64/python3.13/enum.py:1297(__hash__)

(Of the calls to Enum.__hash__, 1530823 come from the same __init__ functions; most of the others come from InterpreterObject.operator_call).

This is because the method and operator dictionaries are rebuilt from scratch for every object. Each string creation is about 100 microseconds, but strings as well as other objects add up quickly due to _holderify calls.

Move operators and methods to a class attribute instead. In the case of method I am only doing so for primitives to keep the pull request smaller, but it is possible (and saves a handful of lines of code) to do this for all objects.

In preparation for moving them to the class, make the operator functions
binary.  Adjust the lambdas for trivial operators, and store unbound
methods for non-trivial ones.

Signed-off-by: Paolo Bonzini <[email protected]>
InterpreterObject, MesonInterpreterObject cannot be used directly, as they contain
nothing that the user can use.  Mark them as abstract.

Likewise for MutableInterpreterObject, which is a mixin.

Signed-off-by: Paolo Bonzini <[email protected]>
Do not call update() and Enum.__hash__ a gazillion times; trivial
operators are the same for every instance of the class.

Introduce the infrastructure to build the MRO-resolved operators (so
the outcome same as if one called super().__init__) for each subclass
of InterpreterObject.

Signed-off-by: Paolo Bonzini <[email protected]>
Do not call update() and Enum.__hash__ a gazillion times; operators
are the same for every instance of the class.  In order to access
the class for non-trivial operators, the operators are first marked
using a decorator, and then OPERATORS is built via __init_subclass__.

Signed-off-by: Paolo Bonzini <[email protected]>
Do not call update() and Enum.__hash__ a gazillion times; operators
are the same for every instance of the class.  In order to access
the class, just mark the methods using a decorator and build
METHODS later using __init_subclass__.

Non-primitive objects are not converted yet to keep the patch small.
They are created a lot less than other objects, especially strings
and booleans.

Signed-off-by: Paolo Bonzini <[email protected]>
@bonzini
Copy link
Collaborator Author

bonzini commented Jan 9, 2025

More timings from QEMU's meson setup:

1.6.0:                  100.46user 14.95system
1.6.99:                  89.47user 14.76system
1.6.99 + #13879:         80.67user 14.59system (wow)
1.6.99 + #13879 + this:  76.91user 14.95system

The main remaining hotspot for QEMU are still flatten_object_list/_determine_ext_objs, especially object_filename_from_source and canonicalize_filename, and generate_single_compile. determine_ext_objs is roughly 10% and canonicalize_filename is about half of it. However QEMU is a bit special and probably uses these more than anyone else.

The remaining lower hanging fruit:

  • execution and generation are roughly a 40-60 split for QEMU. Compiler/linker checks in QEMU are ~20% of the execution time. If they could be somehow done in a separate thread it could be a big win (pkg-config is probably harder but it's another 10%)

  • get_id() is probably a good use for lazy_property (1.5% run time)

  • Paths are expensive. Iterating paths stupidly so due to inefficient implementation of __getitem__. validate_within_subproject is 2% execution time alone and probably can be kicked out of the profile (via either optimization or caching). Path manipulation (join) also appears in get_target_generated_dir.

  • the weird one: ABCMeta.__instancecheck__, costing 3%. It's worth trying to remove all abc superclass annotations, perhaps replacing it with a cheaper version that does not need __instancecheck__. Interestingly, parse nodes do not have an abstract metaclass, hence this would not negate any benefit from double dispatch in evaluate_statement()

Some harder possibilities:

  • isinstance is called no less than 18097946 times for a total of 10% execution time. Of those, 10% come from evaluate_statement(), where it may be possible to experiment with both if/elif ordering and double dispatch. Benchmarking might be hard.

  • A large part of get_base_compile_args is accessing options. OptionsView.getitem is... interesting.

  • deepcopy() appears in the profile thanks to... create_test_serialisation. That's weird and may be worth investigating for another ~0.5% benefit.

Focusing on such small percentage may seem weird, but it depends on how many of them you pile up. And after all, today's 1% was yesterdays's 0.5%. As more optimizations are performed one has to focus on the smaller ones.

And the places that are not interesting IMO:

  • ninja_quote() is by far the most called function in meson, but it's only 1% of the total runtime; arglist's iadd method is the hottest one but I've run out of ideas there

  • regular expressions are only 3%, mostly in CLikeCompilerArgs.to_native; compilation is only 0.5%. It seems too hard for the benefit unlike other options before.

  • many isinstance calls come from list/dict processing in functions like resolver() or _unholder() which are probably not worth it (but they may be low-hanging fruit if one day Meson starts using Cython...).

@bonzini bonzini added this to the 1.8 milestone Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant