Speedup object creation #14091

bonzini · 2025-01-08T10:05:59Z

When building QEMU, about 10% of the time is spent in the various __init__ functions for InterpreterObject subclasses, especially primitives:

    23264    0.507 interpreter/primitives/boolean.py:23(__init__)
    17283    0.518 interpreter/primitives/array.py:30(__init__)
    11259    0.765 interpreter/primitives/integer.py:18(__init__)
   145526    0.886 interpreterbase/baseobjects.py:129(__init__) <-- ObjectHolder
   147974    1.930 interpreterbase/baseobjects.py:41(__init__)
    80265    5.413 interpreter/primitives/string.py:32(__init__)
  1642740    1.700 /usr/lib64/python3.13/enum.py:1297(__hash__)

(Of the calls to Enum.__hash__, 1530823 come from the same __init__ functions; most of the others come from InterpreterObject.operator_call).

This is because the method and operator dictionaries are rebuilt from scratch for every object. Each string creation is about 100 microseconds, but strings as well as other objects add up quickly due to _holderify calls.

Move operators and methods to a class attribute instead. In the case of method I am only doing so for primitives to keep the pull request smaller, but it is possible (and saves a handful of lines of code) to do this for all objects.

In preparation for moving them to the class, make the operator functions binary. Adjust the lambdas for trivial operators, and store unbound methods for non-trivial ones. Signed-off-by: Paolo Bonzini <[email protected]>

InterpreterObject, MesonInterpreterObject cannot be used directly, as they contain nothing that the user can use. Mark them as abstract. Likewise for MutableInterpreterObject, which is a mixin. Signed-off-by: Paolo Bonzini <[email protected]>

Do not call update() and Enum.__hash__ a gazillion times; trivial operators are the same for every instance of the class. Introduce the infrastructure to build the MRO-resolved operators (so the outcome same as if one called super().__init__) for each subclass of InterpreterObject. Signed-off-by: Paolo Bonzini <[email protected]>

Do not call update() and Enum.__hash__ a gazillion times; operators are the same for every instance of the class. In order to access the class for non-trivial operators, the operators are first marked using a decorator, and then OPERATORS is built via __init_subclass__. Signed-off-by: Paolo Bonzini <[email protected]>

Do not call update() and Enum.__hash__ a gazillion times; operators are the same for every instance of the class. In order to access the class, just mark the methods using a decorator and build METHODS later using __init_subclass__. Non-primitive objects are not converted yet to keep the patch small. They are created a lot less than other objects, especially strings and booleans. Signed-off-by: Paolo Bonzini <[email protected]>

bonzini · 2025-01-09T11:11:18Z

More timings from QEMU's meson setup:

1.6.0:                  100.46user 14.95system
1.6.99:                  89.47user 14.76system
1.6.99 + #13879:         80.67user 14.59system (wow)
1.6.99 + #13879 + this:  76.91user 14.95system

The main remaining hotspot for QEMU are still flatten_object_list/_determine_ext_objs, especially object_filename_from_source and canonicalize_filename, and generate_single_compile. determine_ext_objs is roughly 10% and canonicalize_filename is about half of it. However QEMU is a bit special and probably uses these more than anyone else.

The remaining lower hanging fruit:

execution and generation are roughly a 40-60 split for QEMU. Compiler/linker checks in QEMU are ~20% of the execution time. If they could be somehow done in a separate thread it could be a big win (pkg-config is probably harder but it's another 10%)
get_id() is probably a good use for lazy_property (1.5% run time)
Paths are expensive. Iterating paths stupidly so due to inefficient implementation of __getitem__. validate_within_subproject is 2% execution time alone and probably can be kicked out of the profile (via either optimization or caching). Path manipulation (join) also appears in get_target_generated_dir.
the weird one: ABCMeta.__instancecheck__, costing 3%. It's worth trying to remove all abc superclass annotations, perhaps replacing it with a cheaper version that does not need __instancecheck__. Interestingly, parse nodes do not have an abstract metaclass, hence this would not negate any benefit from double dispatch in evaluate_statement()

Some harder possibilities:

isinstance is called no less than 18097946 times for a total of 10% execution time. Of those, 10% come from evaluate_statement(), where it may be possible to experiment with both if/elif ordering and double dispatch. Benchmarking might be hard.
A large part of get_base_compile_args is accessing options. OptionsView.getitem is... interesting.
deepcopy() appears in the profile thanks to... create_test_serialisation. That's weird and may be worth investigating for another ~0.5% benefit.

Focusing on such small percentage may seem weird, but it depends on how many of them you pile up. And after all, today's 1% was yesterdays's 0.5%. As more optimizations are performed one has to focus on the smaller ones.

And the places that are not interesting IMO:

ninja_quote() is by far the most called function in meson, but it's only 1% of the total runtime; arglist's iadd method is the hottest one but I've run out of ideas there
regular expressions are only 3%, mostly in CLikeCompilerArgs.to_native; compilation is only 0.5%. It seems too hard for the benefit unlike other options before.
many isinstance calls come from list/dict processing in functions like resolver() or _unholder() which are probably not worth it (but they may be low-hanging fruit if one day Meson starts using Cython...).

bonzini force-pushed the speedup-objects branch from 0c6b3c5 to c05fb0c Compare January 8, 2025 11:33

bonzini added 5 commits January 8, 2025 14:42

interpreter: make operator functions binary

69dc0d5

In preparation for moving them to the class, make the operator functions binary. Adjust the lambdas for trivial operators, and store unbound methods for non-trivial ones. Signed-off-by: Paolo Bonzini <[email protected]>

bonzini force-pushed the speedup-objects branch from c05fb0c to 33367f5 Compare January 8, 2025 13:44

bonzini mentioned this pull request Jan 9, 2025

Tracking issue for performance improvements PRs #14103

Open

12 tasks

bonzini added this to the 1.8 milestone Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup object creation #14091

Speedup object creation #14091

bonzini commented Jan 8, 2025 •

edited

Loading

bonzini commented Jan 9, 2025 •

edited

Loading

Speedup object creation #14091

Are you sure you want to change the base?

Speedup object creation #14091

Conversation

bonzini commented Jan 8, 2025 • edited Loading

bonzini commented Jan 9, 2025 • edited Loading

bonzini commented Jan 8, 2025 •

edited

Loading

bonzini commented Jan 9, 2025 •

edited

Loading