Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Investigate performance impact of compiler flags #10608

Closed
ggwpez opened this issue Jan 7, 2022 · 15 comments
Closed

Investigate performance impact of compiler flags #10608

ggwpez opened this issue Jan 7, 2022 · 15 comments
Assignees
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U3-nice_to_have Issue is worth doing eventually. Z2-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase.

Comments

@ggwpez
Copy link
Member

ggwpez commented Jan 7, 2022

Try out different compiler flags and check with the benchmarks for any substantial performance increase.
The LTO flag seems promising after being discussed and later on used in Polkadot: paritytech/polkadot#4311.

If you specifically know more flags that could help: please post them here.

@ggwpez ggwpez added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U3-nice_to_have Issue is worth doing eventually. Z2-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. labels Jan 7, 2022
@ggwpez ggwpez self-assigned this Jan 7, 2022
@koute
Copy link
Contributor

koute commented Jan 24, 2022

There's also the -C target-cpu flag (mentioned in paritytech/polkadot#4311); by default AFAIK it's set to x86-64 (run rustc -C target-cpu=help for a full list), and I think we could use the x86-64-v2 by default as CPUs which are a decade old should support it at this point.

The feature levels themselves are described here:

https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/master/x86-64-ABI/low-level-sys-info.tex

@ggwpez
Copy link
Member Author

ggwpez commented Jan 24, 2022

There's also the -C target-cpu flag (mentioned in paritytech/polkadot#4311); by default AFAIK it's set to x86-64 (run rustc -C target-cpu=help for a full list), and I think we could use the x86-64-v2 by default as CPUs which are a decade old should support it at this point.

Thanks for the idea!
This was investigated as well for wasmi wasmi-labs/wasmi#339 (comment)
I also tested it in Substrate on reference hardware, with no performance increase.

I'm not sure if it makes sense to specify a concrete value (besides native) for it, since we are only concerned about the performance on reference hardware.

@koute
Copy link
Contributor

koute commented Jan 24, 2022

That's good to know, thanks for checking!

I was wondering about the actual performance difference since the Moonbeam guys wrote that it gave them "good results"; it's a bummer you've found no increase. (Although maybe it just makes a difference when a significantly newer target is specified, instead of just slightly bumping the baseline?)

@ggwpez
Copy link
Member Author

ggwpez commented Jan 24, 2022

Although maybe it just makes a difference when a significantly newer target is specified, instead of just slightly bumping the baseline?

The reference hardware is a i7-7700K from 2017, so compiling on newer CPU should already make it faster than the weights that we have specified.
I will still put this in my backlog to take another look, just to ensure that we get maximum performance.
PS: We are looking to upgrading the ref hardware soon anyway, so everything here would change again 🙈

@ggwpez
Copy link
Member Author

ggwpez commented Feb 2, 2022

#10777 Talks about target-cpu again.
As it turns out: we currently do not have the tooling in place to actually test this.
I just ran the benchmarks and compared the weights, but these are only the computational complexity minus the DB operations.
So if the target-cpu thing turns out to improve DB performance, I will test it.
Better tooling for this is on the way.

@koute
Copy link
Contributor

koute commented Feb 4, 2022

Could we perhaps use the export-blocks subcommand and then do an import-blocks to indirectly also test the DB performance? (Not sure whether you've already tried it nor if it'd actually be useful; just throwing the idea out.)

@ggwpez
Copy link
Member Author

ggwpez commented Feb 4, 2022

Could we perhaps use the export-blocks subcommand and then do an import-blocks to indirectly also test the DB performance? (Not sure whether you've already tried it nor if it'd actually be useful; just throwing the idea out.)

Yea I actually tried that here 😄
Currently I am trying to reproduce the setup that came up with these hardcoded weights.
Then I can re-run it and get proper values for DB-Read/Write, BlockBase- and ExtrinsicBaseFees (and compare it with target-cpu).

PS: I did not run the import with different profiles, so that could be done. But it takes quite long to run...

@ggwpez
Copy link
Member Author

ggwpez commented May 10, 2022

Does not seem to make a difference as shown here: #10777 (comment)
Closing until we find new compiler flags that we want to test.

@ggwpez ggwpez closed this as completed May 10, 2022
@MathCryptoDoc
Copy link

MathCryptoDoc commented Aug 27, 2022

I wanted to comment on this issue since it seems that the production profile that uses opt-level=3, lto=fat, codegen-units=1 is not always that good when compiling for the native arch. See the summary and more details are in these Python notebooks where I benchmarked basically all possible options. Quite surprisingly, opt-level=2 seems to be better.

The scores uses the benchmarks for BLAKE2 and SR25519, and the timing of the remark extrinsic. Are there other good extrinsics that can be benchmarked via the polkadot binary?

@nazar-pc
Copy link
Contributor

If you're compiling for native then you can tweak other flags too. Default builds are not tuned for native.

@MathCryptoDoc
Copy link

Ah, didn't know this was meant for only default build. Which flags would you use for tuning native then?

@nazar-pc
Copy link
Contributor

I meant I believe flags in this repo don't impact compilation flags in project that use it as dependency. It is up to you to have whatever flags you want in your workspace's Cargo.toml or whatnot.

@ggwpez
Copy link
Member Author

ggwpez commented Aug 29, 2022

I wanted to comment on this issue since it seems that the production profile that uses opt-level=3, lto=fat, codegen-units=1 is not always that good when compiling for the native arch.

Yes possibly. But there have to be some settings for a production build and we cant optimize for every CPU.
I think the default options are reasonable? We tried to pick options to let the compiler do all optimizations it can.

Throwing away builds that do not improve upon the official binary, we have 15, 21, 38, 40, 45 as the good builds. They all build with `target-cpu=native``.

So the default release binary should be good enough for most cases. High performance users have to build themselves anyway.

Are there other good extrinsics that can be benchmarked via the polkadot binary?

It prints all available ones if you call it with the --list flag, but currently there is only one other: transfer_keep_alive.
You can also take a look at the other benchmark commands: README.

@MathCryptoDoc
Copy link

Yes possibly. But there have to be some settings for a production build and we cant optimize for every CPU. I think the default options are reasonable? We tried to pick options to let the compiler do all optimizations it can.

Sure, I agree for a generic production profile, the current settings make a lot of sense of course. I thought this issue was meant to investigate performance of compiler flags in general, including specialised builds. My mistake...

It prints all available ones if you call it with the --list flag, but currently there is only one other: transfer_keep_alive. You can also take a look at the other benchmark commands: README.

Thanks, that convenient. Testing all that will take a long time, but might be interesting.

@ggwpez
Copy link
Member Author

ggwpez commented Sep 2, 2022

Sure, I agree for a generic production profile, the current settings make a lot of sense of course. I thought this issue was meant to investigate performance of compiler flags in general, including specialised builds. My mistake...

We investigated the impact of the flags in order to find good default settings.
Writing some kind of blog article (like you are doing ❤️) is better for more advanced insight.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U3-nice_to_have Issue is worth doing eventually. Z2-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase.
Projects
None yet
Development

No branches or pull requests

4 participants