Skip to content

Commit

Permalink
x86: Align jump targets to 1-byte boundaries
Browse files Browse the repository at this point in the history
The following NOP in a hot function caught my attention:

  >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

That's a dead NOP that bloats the function a bit, added for the
default 16-byte alignment that GCC applies for jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump
target alignments (it's in the Intel optimization manual),
to help their relatively narrow decoder prefetch alignment
and uop cache constraints, but the cost of that is very
significant:

        text           data       bss         dec      filename
    12566391        1617840   1089536    15273767      vmlinux.align.16-byte
    12224951        1617840   1089536    14932327      vmlinux.align.1-byte

By using 1-byte jump target alignment (i.e. no alignment at all)
we get an almost 3% reduction in kernel size (!) - and a
probably similar reduction in I$ footprint.

Now, the usual justification for jump target alignment is the
following:

 - modern decoders tend to have 16-byte (effective) decoder
   prefetch windows. (AMD documents it higher but measurements
   suggest the effective prefetch window on curretn uarchs is
   still around 16 bytes)

 - on Intel there's also the uop-cache with cachelines that have
   16-byte granularity and limited associativity.

 - older x86 uarchs had a penalty for decoder fetches that crossed
   16-byte boundaries. These limits are mostly gone from recent
   uarchs.

So if a forward jump target is aligned to cacheline boundary then
prefetches will start from a new prefetch-cacheline and there's
higher chance for decoding in fewer steps and packing tightly.

But I think that argument is flawed for typical optimized kernel
code flows: forward jumps often go to 'cold' (uncommon) pieces
of code, and  aligning cold code to cache lines does not bring a
lot of advantages  (they are uncommon), while it causes
collateral damage:

 - their alignment 'spreads out' the cache footprint, it shifts
   followup hot code further out

 - plus it slows down even 'cold' code that immediately follows 'hot'
   code (like in the above case), which could have benefited from the
   partial cacheline that comes off the end of hot code.

But even in the cache-hot case the 16 byte alignment brings
disadvantages:

 - it spreads out the cache footprint, possibly making the code
   fall out of the L1 I$.

 - On Intel CPUs, recent microarchitectures have plenty of
   uop cache (typically doubling every 3 years) - while the
   size of the L1 cache grows much less aggressively. So
   workloads are rarely uop cache limited.

The only situation where alignment might matter are tight
loops that could fit into a single 16 byte chunk - but those
are pretty rare in the kernel: if they exist they tend
to be pointer chasing or generic memory ops, which both tend
to be cache miss (or cache allocation) intensive and are not
decoder bandwidth limited.

So the balance of arguments strongly favors packing kernel
instructions tightly versus maximizing for decoder bandwidth:
this patch changes the jump target alignment from 16 bytes
to 1 byte (tightly packed, unaligned).

Acked-by: Denys Vlasenko <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
  • Loading branch information
Ingo Molnar committed May 15, 2015
1 parent b41e6ec commit be6cb02
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions arch/x86/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ else
KBUILD_AFLAGS += -m64
KBUILD_CFLAGS += -m64

# Align jump targets to 1 byte, not the default 16 bytes:
KBUILD_CFLAGS += -falign-jumps=1

# Don't autogenerate traditional x87 instructions
KBUILD_CFLAGS += $(call cc-option,-mno-80387)
KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
Expand Down

0 comments on commit be6cb02

Please sign in to comment.