Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which optimation and why? #333

Closed
KlausMeier opened this issue May 27, 2019 · 26 comments
Closed

Which optimation and why? #333

KlausMeier opened this issue May 27, 2019 · 26 comments

Comments

@KlausMeier
Copy link

Have you test all your optimations and find a benefit? My standard flags are CFLAGS="-march=native -O2 -pipe -ftree-vectorize"

Then I add -flto. The binary file size was very much better and the performance was a little bit better. I testet it with lame, p7zip and unrar.

Next was the full thing with -O2: CFLAGS="-march=native -O2 ${SEMINTERPOS} ${GRAPHITE} ${IPA} ${FLTO} -fuse-linker-plugin -ftree-vectorize -pipe" File size was the same but performance was a little bit grimmer.

At last I testet the full thing: CFLAGS="-march=native ${CFLAGS} -pipe -ftree-vectorize" The file size was bigger then the standard optimation without lto. For lame the performance was a little bit better, for p7zip it was a little bit grimmer.

I think, if you build a whole system with -O3, it is a very big disadvantage. Do you think, you can copy your files faster, if you build the coreutils with -O3? The opposite is the case. The binary is much bigger, it takes more time to load uses more RAM.

Yes, there are some packages, which works faster with -O3 or -Ofast. But they have this flags in portage. But don't use -O3 for everything.

@InBetweenNames
Copy link
Owner

Binary size is a common misconception in relation to performance -- only the pages from the binary that are actually used are loaded into memory. Higher binary size does not necessarily mean worse performance or loading times. For example, The ArrayFire libraries can reach multiple gigabytes in size, yet never page more than a few megabytes of memory. Intel does this a lot with Clear Linux, too -- they compile hot sections for multiple arches, which absolutely increases binary size, but those sections that aren't used simply aren't paged in.

So, I'm not worried about bigger binaries from using -O3. Based on recent benchmarks, I am considering removing GRAPHITE however. When I get some time, I will ping the upstream GCC devs about what the plan for it is. I know it's being maintained, but if it's more than just life support I'm not sure. It's unfortunate, because clang DOES have an actively maintained polyhedral optimization framework integrated in master.

  • -fipa-pta and -fdevirtualize-at-ltrans should be clear wins for performance.
  • -fno-semantic-interposition should as well.
  • -fno-pta has to be weighed against lazy binding. If you prelink, it's a non-issue and a clear win.
  • -ftree-vectorize is turned on by default at -O3

Now, there are a number of processor specific flags you might consider using as well. If you use Intel, -falign-functions=32 is a win. I know of some users using -flive-range-shrinkage as well, but I haven't had a chance to mess with that one.

In the end, though, this is what make.conf.lto.defines are for. If you really do need small binaries, you can run GentooLTO using -flto -Os -fno-pta and be sure you are getting the smallest possible binaries. That's a supported configuration.

By default, the philosophy of the overlay is to let the compiler decide what to do. -O3 gives it the full freedom to do that.

@KlausMeier
Copy link
Author

Ok, I test some more. But first, I only use -O2 because it works better for me than -O3. And then, with -O3 you enable -ftree-slp-vectorize, not -ftree-vectorize. I am on an Ryzen 2600X.

But i test the other options without GRAPHITE. Perhaps I am happy. Until now I test all or nothing.

@InBetweenNames
Copy link
Owner

Great!

Actually, it seems -ftree-vectorize enables both -ftree-slp-vectorize and -ftree-loop-vectorize, which are in turn both enabled by -O3. Does -ftree-vectorize do anything aside from enable those two other flags?

@KlausMeier
Copy link
Author

Sorry, you are right. With -O3 -ftree-slp-vectorize and -ftree-loop-vectorize is enabled.
At the moment I only testet it with p7zip. I removed GRAPHITE and it works faster. But -O3 is slower than -O2. At the moment CFLAGS="-march=native -O2 ${SEMINTERPOS} ${IPA} ${FLTO} -fuse-linker-plugin -ftree-vectorize -pipe" is the best for me. Best performance and small file size. If you have better flags, show me the benchmarks.

@InBetweenNames
Copy link
Owner

What are you using to do your benches?

@KlausMeier
Copy link
Author

KlausMeier commented May 27, 2019

First, LTO works very very good! I can reduce the ram usage of my system for about 100 MB and I think, if I start my system, I had to load only 600 MB instead of 700 MB and this was quicker. In the benchmarks, with LTO there was sometimes a little performance advantage, but never a disadvantage.

I tried lame, p7zip and unrar. First CFLAGS="-march=native -O2 -pipe -ftree-vectorize" against CFLAGS="-march=native -O2 ${SEMINTERPOS} ${GRAPHITE} ${IPA} ${FLTO} -fuse-linker-plugin -ftree-vectorize -pipe" and CFLAGS="-march=native ${CFLAGS} -pipe".

With -O3 lame was 1% faster, p7zip and unrar slower. For the he rest I tested only p7zip. There was no difference between CFLAGS="-march=native -O2 ${SEMINTERPOS} ${GRAPHITE} ${IPA} ${FLTO} -fuse-linker-plugin -ftree-vectorize -pipe" and CFLAGS="-march=native -O2 ${FLTO} -fuse-linker-plugin -ftree-vectorize -pipe". But then I removed GRAPHITE and it works faster. I switched to -O3 and it was slower again.

I think, it is absolutely useless to build a hole system with -O3. 90% of all applications don't have any advantage. Do you think, you can copy files faster, when you build the coreutils or dolphin with -O3? Do you think, you can edit files faster, if you build nano with -O3? And the rest? It is possible, that the application runs faster with -O3 but in my tests p7zip and unrar runs slower.

-O3 is only useful for applications, if you can see a benefit. Don't use it, if you only see, that you can build the application.

@InBetweenNames
Copy link
Owner

I'm wondering more about your methodology. Which benchmarks did you run exactly? Note that Clear Linux uses -O3 by default and consistently beats other distros in Phoronix Test Suite benchmarks. It also enables other things like -fno-math-errno and -fno-trapping-math, which we don't by default (but will have a supported configuration for). Something I have planned later this year is to run the PTS on my own system and bench it against Clear Linux. I expect we'll be trading blows or doing better.

Where -O3 genuinely performs worse than -O2, this should be reported as a compiler bug. Part of GentooLTO is to find those!

@funghetto
Copy link
Contributor

@InBetweenNames are there still plans to integrate clear linux's patches in this overlay?

@InBetweenNames
Copy link
Owner

Yes, there are -- I've started off by first merging in their CFLAGS. The Clear Linux issue should still be open. Relevant issue is #164.

Next we need to complete the refactor to support multiple configurations. The relevant issue for that is #307 The LTO patching mechanism will likely be extended to pull in Clear Linux patches after that. Perhaps a USE=clearlinux will be added to sys-config/ltoize to facilitate that.

@KlausMeier
Copy link
Author

KlausMeier commented May 27, 2019

Hey, I told you. Can you told me, why I can copy files faster, when I build the coreutils with -O3? And the rest? Have you ever tested something? Please tell me, what application runs faster with your optimization? With your own tests.

If I start my system and I had to load 500 MB, I think, it takes longer if I had to load 800 MB. Is this in the phoronix benchmark? I tried something, which I use at the moment.

@InBetweenNames
Copy link
Owner

@KlausMeier , I'll be honest with you -- I'm not sure what you're after here. Do you want me to justify the existence of GentooLTO as a project? What's your end goal with this issue? We can go back and forth all day about the finer points of optimization flags, but it seems like there is something else that's on your mind.

@KlausMeier
Copy link
Author

And I think, most benchmarks are bullshit for a desktop system. I don't run a database ore wan't to compute pi. The performance of a desktop system is the time to load the application. Nothing else. And there is no benchmark for this. Yes, you can optimize one application very well, but don't think, your system runs faster, if you use this optimizations for the hole system.

@InBetweenNames
Copy link
Owner

Just trying to get your mindset here. It seems like you think GentooLTO is a waste of time?

@KlausMeier
Copy link
Author

KlausMeier commented May 27, 2019

Ok, I understand. It is better to leave.I want the best performance for a system. I told you, that for example LTO works perfect for me. Do you think, I use this overlay, if I think, it is useless? But when I see, that some optimations reduce my performance? I see no benchmarks in this thread where there is an proof for all this shit. And when I test something, you told me this? I spend very much time to test all your settings. And I report on my experiences. I don't see everybody else here, who do some testing.

If you have a problem with that, have a very nice day.

@KlausMeier
Copy link
Author

I tested a lot.But that was not welcome. I think, that you had to use optimations for every application separately, if you want the best performance. There is nothing for all. But if you think, I think GentoLTO is a waste of time because I spend my time to optimize it? I tested some applications with a lot of optimations. And I told you the results. And you told me "It seems like you think GentooLTO is a waste of time?".

GentooLTO is no waste of time. But my time, I am using with GentooLTO was a very big waste of time for me, when I see your answer.

@InBetweenNames
Copy link
Owner

You're asking questions that we just don't have the answers for yet. I never promised absolute performance. The flags that are chosen by default should make the system faster, but that doesn't mean they will! It says it right in the third line of the README. It's theoretically maximum speed. If it's not, that's a bug in GCC. It would be nice if people reported performance degradations upstream to GCC, but I'm not sure how many do. I post all upstream issues here, for reference. Most of the time, when I look upstream for a codegen related issue, it's already been reported which is nice.

But when I see, that some optimations reduce my performance? I see no benchmarks in this thread where there is an proof for all this shit

The benchmarks are an open problem and people have been benchmarking, albeit sporadically, since the project started. Occasionally a new thread pops up with results. Most of the time they are encouraging, even if only by a few percent. But there's a long, long way to go. If there's one critical result that will come out of GentooLTO one day, it's how to benchmark an operating system. Where is the proof, you ask? Well, it's yet to be contributed. Certainly I'm not ready to spend time doing that. I have a PhD to finish, and GentooLTO works for my purposes in my domain-specific context. Will it work in yours? I don't know! You tell me.

Let me ask you this: what is the best way of benchmarking a GentooLTO system? Is it by running a couple of lame of p7zip tests locally? Is it by running apache and seeing how many requests can be handled per second? Is it startup time? Binary size? Code section size? Number of page faults? It's an open problem, and having an answer would be really nice. The Phoronix Test Suite is the best thing we have currently in that regard, and it has a long way to go, too.

And when I test something, you told me this? I spend very much time to test all your settings. And I report on my experiences. I don't see everybody else here, who do some testing.

And I really appreciate that. But I want to be clear here: this is a volunteer project, and I don't expect anyone to do anything at the end of the day, really. It's nice when people contribute things. But don't think for a second that you are obligated to do anything for this project, nor is this project obligated to do anything for you, no matter how much time you spend on the project. This is not a job, and thank god for that. You should mess around with GentooLTO if you like it and are interested in it. If you're results oriented, you should adjust your expectations accordingly and use your own configuration.

I tested a lot.But that was not welcome. I think, that you had to use optimations for every application separately, if you want the best performance. There is nothing for all. But if you think, I think GentoLTO is a waste of time because I spend my time to optimize it? I tested some applications with a lot of optimations. And I told you the results. And you told me "It seems like you think GentooLTO is a waste of time?".

Sure, if performance is your goal and none of the other stated goals of GentooLTO, then you will need to tailor your setup specifically for that. That's what make.conf.lto.defines is for. At this point, I mostly hear from users running their own combinations of flags using that, which is why we're moving over to a configuration based system instead of a global default. Who knows, maybe a @KlausMeier configuration will find its way in there too. There will be a "LTO-only" configuration at least, where the rest is up to you.

GentooLTO is no waste of time. But my time, I am using with GentooLTO was a very big waste of time for me, when I see your answer.

Okay, but I don't really want to spend more time arguing about the premise of the project. The goal is not "absolute performance for Gentoo", the goal is "theoretically absolute performance for gentoo, using deviations from that to file bug reports and improve open source software, including GCC".

Personally, I have observed the default configuration has improved performance on my system. Do I think it'll make files copy faster? No, that's an IO-bound process, and -O0 probably performs as well as -O2. Do I think I can edit files faster if I build nano with -O3? Probably not! But it also shouldn't be worse than -O2. If it is worse than -O2, it's a bug, full stop.

Furthermore, I just don't have the time or resources to maintain a list of -O3 exceptions for the sake of practical speed. Maintaining the other workarounds is hard enough. After the refactor, you'll be able to have an LTO-only configuration as a clean slate and you can go from there.

I just so happen to be one of those users that wants to run a database and compute pi, and yeah, I benefit from GentooLTO. That's why I take the time to maintain it.

@Kokokokoka
Copy link

I think that there are some good points.

  1. Now there are many gcc flags which users can use blindly without proper testing
  2. It is a rather difficult task to create proper test as generally packages can do
    better with different optimisation flags
  3. the argument about -O2 is interesting as I saw in a different issue opt-flags testing framework and the best performance was (as far as I can remember) -O2 -flto and some other flags such as -malign=cacheline and other align funcs
    So.
  4. To have a @System testing bench is a good idea, but it's difficult to do
  5. It is interesting to adapt clear linux patches
  6. One can test with phoronix test suit (but this test is almost irrelevant)
  7. Use the bench for the flags which was in another issue and try to populate it

@KlausMeier
Copy link
Author

I used it. I like it. I tested it and I report my results. And for all of this the answer was "It seems like you think GentooLTO is a waste of time?". Now I removed it. Today I spend two hours with testing. But facts are not welcome.

@Kokokokoka
Copy link

I used it. I like it. I tested it and I report my results. And for all of this the answer was "It seems like you think GentooLTO is a waste of time?". Now I removed it. Today I spend two hours with testing. But facts are not welcome.

I think that you're jumping to conclusions here.
Your discoveries are very interesting and I'd test my system with your config
with the addition of -malign and other aligns and some compiler fixes, such as:
CBR="-fwrapv -fno-delete-null-pointer-checks -fno-strict-aliasing -fno-strict-overflow"

@InBetweenNames
Copy link
Owner

The issue in question is #288 and it's great work that I intend to try myself when I get a chance.

@KlausMeier the only reason I said that was because you seemed to be questioning the very premises of the GentooLTO project. It seemed like you were unhappy with the results and wanted me to change the project defaults to what you found to work best on your system. You wanted proof that the defaults were in fact the highest performance. Perhaps I could have worded my question better, but this is really what I was after.

Ultimately though, doing this would mean shifting the project focus from bug finding to "just give me the best performance", and that's something I simply can't promise anyone. The default configuration SHOULD give the best performance but that doesn't mean it does. #288 is very interesting and I think would fit really nicely into the new GentooLTO layout.

@javashin
Copy link

I SMELL TROLLING...........

@javashin
Copy link

javashin commented May 27, 2019

This Troll Was Created ON May 14, 2019 , where he was before that ..... I New User On Github Just TO Troll Here ......

@InBetweenNames
Copy link
Owner

I think they just didn't have a GitHub account before they were interested in GentooLTO.

@javashin
Copy link

javashin commented May 27, 2019

GentooLTO is the best overlay around we need to do some bench .

eix phoronix

  • app-benchmarks/phoronix-test-suite [1]
    Available versions: ()8.0.1-r1 ()8.2.0-r1 ()8.4.1 ()8.6.0 **9999 {sdl}
    Homepage: http://www.phoronix-test-suite.com
    Description: Phoronix's comprehensive, cross-platform testing and benchmark suite

[1] "bobwya" /var/lib/layman/bobwya

im installing it on the weekend along with clear linux and see how my prelinked system goes vs clear linux .

@pchome
Copy link
Contributor

pchome commented May 27, 2019

@KlausMeier
GentooLTO project not about squeezing maximum performance, but about LTO.
GRAPHITE, PGO and all other additional compiler/linker flags just a side effect, because "why not".
IMHO.

@Kokokokoka

To have a @System testing bench is a good idea, but it's difficult to do

@system and @world, what a great github user names to track Gentoo issues 🤣

p.s. Have my whole system LTOed. No single performance test was performed on my system yet.

@InBetweenNames
Copy link
Owner

Going to close this issue as it seems nothing more is coming out of it. If it needs to be reopened, just comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants