-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of Clear Linux patches #164
Comments
I didn't look thoroughly, but if I'm seeing it right, these are just patches in their git repo? It would be very easy to put the patches into /etc/portage/patches, is there some downside to this? |
It appears that way, however I'm guessing they also fine tune compiler options as well. Also, I'm unsure if they use |
Interesting flag: |
Other interesting things: they appear to enable |
It appears that |
I just tested an architecture that has AVX512 instructions:
It seems by default on |
Phoronix claims that they build using GCC/Clang (differs from package to package). |
Makes sense - |
I heard that also Solus is using some of their optimizations. |
@InBetweenNames one thing that clear linux does that isn't really necessary for us is function multiversioning. I think the linker links different functions based on eg. AVX support. since everyone here is probably building with -march=native we can get smaller binaries than Clear can, may have better LTO opportunities, etc. It would be interesting to see if we can get Michael to bench lto-overlay vs. Clear, esp. once we steal some of their fancy tricks. regarding -falign-functions I'd expect this to be about aligned jumps/instruction cache line reads. maybe RIP relative addressing? but once you are executing in the function instruction alignment is going to be off no matter what you do thanks to variable length instructions. even if AVX cared about instruction alignment, I'm not sure this would help. is it possible that there is a more compact way to load immediates if they are 32-aligned, so function call sites are smaller? |
Agreed -- in fact we should do even better than function multi versioning since we're compiling our system exactly tailored for the system it's running on. This means more opportunities for LTO all around. Not to mention, this should be highly portable across many architectures. If one could link their system using mainly static libraries, I bet the LTO benefits would be even more profound. You can link-optimize across static library boundaries, and you can't do that with shared objects. I don't believe this is possible as-is however, since Portage seems to really prefer shared objects, and configure scripts, etc, also prefer shared objects. I was wondering the same about http://hpac.rwth-aachen.de/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf I also found a StackOverflow question that indirectly touches on it:
Without delving in the GCC internals, I can't find many resources that recommend this flag. I'll see if the Intel guys will shed some light on it. |
More goodies: Line 481: https://github.com/clearlinux/autospec/blame/master/autospec/specfiles.py In that commit, I find it interesting they enable
Even with the most aggressive optimization package, this is on by default. |
Found when it was added: https://github.com/clearlinux/autospec/blame/a5260d7ce751774d46e0a957786d179456a14275/autospec/buildpattern.py It was added by @fenrus75 for "high speed cases". Interesting. |
I notice that on packages that are optimized for size, they enable |
After researching a bit more, it looks like |
OK -- so locally, I have enabled |
OK -- I think i figured it out: https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646 For more info: https://lkml.org/lkml/2015/5/19/1009 It looks like the historical reason for
From Agner Fog's docs: https://www.agner.org/optimize/microarchitecture.pdf See "Instruction Fetch" sections for details. However, consider that cache lines are usually 64 bytes long -- depending on your processor. From Ingar's post:
As for why
So, for functions that are greater than the cache line size, aligning on a a cache line boundary makes the most sense. For functions that are less than the cache line size, this isn't ideal as it wastes I$ space. Of course, when inlining is taken into account, which is a much higher probability since we are using system-wide LTO, this whole discussion becomes moot. However, this is still a problem for shared objects. Ideally, GCC/ld would be smarter about how it aligns functions. Work has been done to this end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240 So, looking at
In other words, And another goodie
This flag is off by default! Delving in the GCC source code in file
and the calling code in
So in the worst case, we still get 8-byte function alignment for functions that are smaller than
The above should guarantee the following, for a function that takes
The check is done in this order to prevent wasting space. So, it seems to me we should be using I will test out my GCC patch and if it works OK, I will submit it upstream. |
nice detective work!
…On Sat, Nov 3, 2018, 13:20 Shane Peelar ***@***.*** wrote:
OK -- I think i figured it out:
https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646
For more info:
https://lkml.org/lkml/2015/5/19/1009
It looks like the historical reason for -falign-functions=16 is:
The instruction fetch unit can fetch a maximum of 16 bytes of code per
clock cycle
From Agner Fog's docs:
https://www.agner.org/optimize/microarchitecture.pdf
See "Instruction Fetch" sections for details.
However, consider that cache lines are usually 64 bytes long -- depending
on your processor. From Ingar's post:
So based on those measurements, I think we should do the exact
opposite of my original patch that reduced alignment to 1 bytes, and
increase kernel function address alignment from 16 bytes to the
natural cache line size (64 bytes on modern CPUs).
As for why -falign-functions=32 was chosen? I have a feeling it's
actually a compromise. See this reply by Linus:
https://lkml.org/lkml/2015/5/19/1142
Is there some way to get gcc to take the size of the function into
account? Because aligning a 16-byte or 32-byte function on a 64-byte
alignment is just criminally nasty and wasteful.
So, for functions that are greater than the cache line size, aligning on a
a cache line boundary makes the most sense. For functions that are less
than the cache line size, this isn't ideal as it wastes I$ space. Of
course, when inlining is taken into account, which is a much higher
probability since we are using system-wide LTO, this whole discussion
becomes moot. However, this is still a problem for shared objects.
Ideally, GCC/ld would be smarter about how it aligns functions.
Work has been done to this end:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240
It has been merged in trunk!
So, looking at -falign-functions once again:
Align the start of functions to the next power-of-two greater than n,
skipping up to n bytes. For instance, -falign-functions=32 aligns functions
to the next 32-byte boundary, but -falign-functions=24 aligns to the next
32-byte boundary only if this can be done by skipping 23 bytes or less.
In other words, -falign-functions=24 will align all functions to 32-byte
boundaries except those that are 8 bytes in size or less.
And another goodie -flimit-function-alignment:
If this option is enabled, the compiler tries to avoid unnecessarily
overaligning functions. It attempts to instruct the assembler to align by
the amount specified by -falign-functions, but not to skip more bytes than
the size of the function.
This flag is off by default!
Delving in the GCC source code in file gcc/config/i386/x86-64.h:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP) \
do { \
if ((LOG) != 0) { \
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG)); \
else { \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
/* Make sure that we have at least 8 byte alignment if > 8 byte \
alignment is preferred. */ \
if ((LOG) > 3 \
&& (1 << (LOG)) > ((MAX_SKIP) + 1) \
&& (MAX_SKIP) >= 7) \
fputs ("\t.p2align 3\n", (FILE)); \
} \
} \
} while (0)
and the calling code in gcc/varasm.h:
...
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
int align_log = align_functions_log;
#endif
int max_skip = align_functions - 1;
if (flag_limit_function_alignment && crtl->max_insn_address > 0
&& max_skip >= crtl->max_insn_address)
max_skip = crtl->max_insn_address - 1;
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
#else
ASM_OUTPUT_ALIGN (asm_out_file, align_functions_log);
#endif
}
So in the worst case, we still get 8-byte function alignment for functions
that are smaller than falign_functions in size. So, with the default, you
get at most 16 bytes alignment and at least 8 bytes alignment with
-flimit-function-alignment. It would probably make more sense to make it
the L1 cache line size bytes by default and at least 16 bytes with
-flimit-function-alignment. This is a pretty trivial change to make:
#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP) \
do { \
if ((LOG) != 0) { \
if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG)); \
else { \
fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP)); \
if ((1 << (LOG)) > ((MAX_SKIP) + 1)) \
{ \
/* Make sure that we have at least 16 byte alignment \
if > 16 byte alignment is preferred. */ \
if ((LOG) > 4 && (MAX_SKIP) >= 15) \
fputs ("\t.p2align 4\n", (FILE)); \
/* Make sure that we have at least 8 byte alignment if > 8 byte \
alignment is preferred. */ \
else if ((LOG) > 3 && (MAX_SKIP) >= 7) \
fputs ("\t.p2align 3\n", (FILE)); \
} \
} \
} \
} while (0)
The above should guarantee the following, for a function that takes b
bytes, with -falign-functions=n and -flimit-function-alignment:
- If b >= n: for sure will be aligned to n
- If n > 16 and 16 <= b < n: will be at least aligned to a 16 byte
boundary
- Otherwise, if n > 8 and 8 <= b < n: will be at least aligned to a 8
byte boundary
The check is done in this order to prevent wasting space.
So, it seems to me we should be using -falign-functions=${L1ICACHELINESIZE}
-flimit-function-alignment.
I will test out my GCC patch and if it works OK, I will submit it upstream.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB74uEntyyExWm03jNTiXJrGfz5RZ6Dks5urfqlgaJpZM4YLC7Z>
.
|
Heh, of course the GCC devs beat me to the punch. It looks like they are reworking the |
Thanks! In GCC trunk, we have this nice thing:
So, we may be able to say something like
Obviously these values would need to be tweaked. But it would give the desired result at least. This doesn't appear to be documented anywhere however. |
Okay, it is documented and I simply didn't look hard enough. It's hard to retain the old behaviour with the new method, since the secondary alignment will only be triggered if |
Got a response from Arjan van de Ven!
Very interesting. So, Ingo's findings confirm this to a degree, and suggest even stronger alignment requirements are beneficial. He says it best here: https://lkml.org/lkml/2015/5/21/443
For fun, here's an attempt to restore the functionality in my previous patch on GCC trunk:
So, with
Examples for the above would be Obviously such a scheme would need benchmarks to show it's worth doing over the default. It could potentially waste space, too, since a function with Regardless of whether the default schemes or the one I posted above is used, based on what we have seen, As I find this issue to be very interesting, I'd like to leave it up for discussion, especially in the hopes we get some benchmarks using combinations of these flags. My diff against GCC trunk for my own alignment scheme is below, in case anyone wants to try it on GCC trunk:
|
Enable -fno-semantic-interposition by default Add off by default define for -fno-common Reference #164
In addition to the function alignment, there's also data-alignment which takes a cacheline option: I also build my system with: |
The Clear Linux "fast-math" options can be very beneficial to auto-vectorization, it gives many more opportunities than the default IEEE754 strict compliance. |
@sjnewbury: I got the impression that I've been hemming and hawing about the strict IEEE compliance myself, and I've decided we can support it as an opt-in enhancement. I know some users of this overlay are using it for scientific computations, and I don't want to interfere with that automatically. |
One more thing: is |
I use the code from this bug report to benchmark -falign-functions: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58863 Also you should look into --param inline-unit-growth=5 --param max-unroll-times=2 |
Well, I don't want to use Now of course, we may want to enable these flags on a per-package basis where improvement has been proven via benchmarks (as with your php example). Otherwise, it should be the defaults with possibly a few optional tweaks on a per package basis. |
Interesting. I didn't get any error using the GNU ld (BFD), but I'm getting an error too on eudev if I switch my linker to gold. Not reproducing the issue on python however. |
Ah, of course switching the linker didn't occur to me. btrfs-progs and eudev indeed do build for me with |
As for python, turns out I get the error only when combining gold and pgo. |
This is all very good to know. Definitely will hold off on |
FWIW (Currently building gentooLTO+x32+auto-prelink+autopar+jemalloc) |
Just a heads up, I'm considering creating some ebuilds with the Clear Linux patches in my overlay. I tested their kernel patches today to great success (reduced boot time by about 25%) so I will create an ebuild for at least the patched kernel and maybe some other packages, depending on how much success I have with them. |
@aw1cks I see they change the perf_bias to default to performance instead of normal. How much does this account for? This setting makes a big difference on my Ivy Bridge laptop, but I have it set to toggle on power events to maximise battery life. |
@sjnewbury I don't use it on a laptop. I haven't got round to rebuilding my laptop with gentoo yet, but on my desktop I didn't use all the patches but rather the ones relating to boot speed & performance without much regard for the patches claiming to reduce wakelocks. As far as I can tell, in their use case with Clear Linux the difference in power is more than offset by the other tweaks which they have made. How much of this extra battery life comes from their kernel, I couldn't say, seeing as they have custom patches applied to many userland applications and even gcc itself. You would have to benchmark it to know for sure. If you want to test the kernel yourself, you can use any 4.19 series kernel and put the patches into |
for power, do look at the clr-power-tweaks package... that is where much of
the tuning happens
…On Wed, Dec 5, 2018, 16:10 Alex Wicks ***@***.*** wrote:
@sjnewbury <https://github.com/sjnewbury> I don't use it on a laptop. I
haven't got round to rebuilding my laptop with gentoo yet, but on my
desktop I didn't use all the patches but rather the ones relating to boot
speed & performance without much regard for the patches claiming to reduce
wakelocks. As far as I can tell, in their use case with Clear Linux the
difference in power is more than offset by the other tweaks which they have
made. How much of this extra battery life comes from their kernel, I
couldn't say, seeing as they have custom patches applied to many userland
applications and even gcc itself. You would have to benchmark it to know
for sure. If you want to test the kernel yourself, you can use any 4.19
series kernel and put the patches into
/etc/portage/patches/sys-kernel/${KERNEL_SOURCE_PKG_NAME}-${KERNEL_VERSION}/.
The patches are available here <https://github.com/clearlinux-pkgs/linux>.
Just as a note, I do hope to eventually create ebuilds which DO include
their userland patches for various programs available as a useflag, and
then in that case we can make a fair comparison. Additionally, I do wonder
if the use of systemd vs openRC could play a role here (in my experience, I
have had higher power consumption when using init systems other than
systemd - maybe it's something I'm doing wrong, I couldn't tell you) . If
you do find anything out, please let me know as I'm quite interested in
this myself for my laptop.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFZ_CJF_XroYBSVNXkIH9e3GrVxGMks5u1-HYgaJpZM4YLC7Z>
.
|
@fenrus75 thanks for pointing in the right direction. How can I build this package without Clear Linux userspace tools? I can't find any binaries in the repository, nor any of the releases, and the Makefile references a file not included in the repository. |
uh you might need to rpm2cpio our src.rpm
…On Wed, Dec 5, 2018, 16:36 Alex Wicks ***@***.*** wrote:
@fenrus75 <https://github.com/fenrus75> thanks for pointing in the right
direction. How can I build this package without Clear Linux userspace
tools? I can't find any binaries in the repository, nor any of the
releases, and the Makefile references a file not included in the repository.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFfDE8aQXb35Ikn2C7kN3DAAQS8ieks5u1-fjgaJpZM4YLC7Z>
.
|
Great, thanks. However I'm having an issue with autoconf.
Some missing include maybe? |
Leaving a link to this new Phoronix post here, benchmarking some of the performance gains from Clear Linux, to give some extra context to this issue and get some idea of we can hope for from following up: https://www.phoronix.com/scan.php?page=article&item=clear-faster-blas&num=1 |
im prelinking gentoo too with my new install gentoo nomultilib lto nopie nossp |
@InBetweenNames Did you by any chance look into Even if it's not worth enabling globally, perhaps packages that can't be build with full LTO could benefit from something like |
@jelinekto Umm, https://stackoverflow.com/questions/4274804/query-on-ffunction-section-fdata-sections-options-of-gcc . This is not as straightforward as you might think. |
Integration of Clear Linux's 'multi-thread-default.patch' By default, zstd uses one core for compression. This patch makes zstd use all physical cores detected for compression, increasing performance and reducing compression time. Below's results are from using zstd's built-in benchmark, showing a decrease of 78.13% in compression time with -T0. Default (-T1) 19#linux-5.8.tar : 983869440 -> 121381009 (8.106), 4.05 MB/s ,1771.3 MB/s zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar 244.99s user 0.46s system 99% cpu 4:05.47 total Patched default (-T0) 19#linux-5.8.tar : 983869440 -> 121384544 (8.105), 19.2 MB/s ,1756.7 MB/s zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar 297.19s user 0.63s system 554% cpu 53.692 total Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch References: InBetweenNames#164
Integration of Clear Linux's 'multi-thread-default.patch' By default, zstd uses one core for compression. This patch makes zstd use all physical cores detected for compression, increasing performance and reducing compression time. Below's results are from using zstd's built-in benchmark, showing a decrease of 78.13% in compression time with -T0. The change is from 1 core to 6 (physical) cores and will differ based on machine and file contents. Default (-T1) 19#linux-5.8.tar : 983869440 -> 121381009 (8.106), 4.05 MB/s ,1771.3 MB/s zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar 244.99s user 0.46s system 99% cpu 4:05.47 total Patched default (-T0) 19#linux-5.8.tar : 983869440 -> 121384544 (8.105), 19.2 MB/s ,1756.7 MB/s zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar 297.19s user 0.63s system 554% cpu 53.692 total Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch References: InBetweenNames#164
Integration of Clear Linux's 'multi-thread-default.patch' By default, zstd uses one core for compression. This patch makes zstd use all physical cores detected for compression, increasing performance and reducing compression time. Below's results are from using zstd's built-in benchmark, showing a decrease of 78.13% in compression time with -T0. The benefit is only apparent if compression is CPU-bound and will differ based on machine and file contents. Default (-T1) 19#linux-5.8.tar : 983869440 -> 121381009 (8.106), 4.05 MB/s ,1771.3 MB/s zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar 244.99s user 0.46s system 99% cpu 4:05.47 total Patched default (-T0) 19#linux-5.8.tar : 983869440 -> 121384544 (8.105), 19.2 MB/s ,1756.7 MB/s zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar 297.19s user 0.63s system 554% cpu 53.692 total Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch References: InBetweenNames#164
Integration of Clear Linux's 'multi-thread-default.patch' By default, zstd uses one core for compression. This patch makes zstd use all physical cores detected for compression, increasing performance and reducing compression time. Below's results are from using zstd's built-in benchmark, showing a decrease of 78.13% in compression time with -T0. The benefit is only apparent if compression is CPU-bound and will differ based on machine and file contents. Default (-T1) 19#linux-5.8.tar : 983869440 -> 121381009 (8.106), 4.05 MB/s ,1771.3 MB/s zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar 244.99s user 0.46s system 99% cpu 4:05.47 total Patched default (-T0) 19#linux-5.8.tar : 983869440 -> 121384544 (8.105), 19.2 MB/s ,1756.7 MB/s zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar 297.19s user 0.63s system 554% cpu 53.692 total Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch References: #164
So, let's make it straight: is |
I'm late to the party but I was having fun with your awesome LTO patches and various flags today, testing them with On my intel i7 7700k, Now I know this is one, very specific, maybe even a bit stupid benchmark which I used to test those flags, but it's definitely not universal to say that newer intels should use Just my 3 cents, maybe it'll help somebody, maybe it won't. I suggest running benchmarks to verify whether the flag is helping or not. Personally I dug it up due to the fact that after applying LTO flags the benchmark dropped by approx 10% compared to just |
@JustArchi Very interesting observation, I was contemplating about whether It may just be a fluke in the testing patterns of |
Is |
Clear Linux maintains a number of performance related patches for open source projects. There are quite a few: https://github.com/clearlinux-pkgs
It would be interesting to integrate these into GentooLTO somehow, either synced into the overlay or through a user install mechanism.
The text was updated successfully, but these errors were encountered: