Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses of GlobalAlloc::realloc yield useless code in lto+opt-level=2 builds #56938

Open
glandium opened this issue Dec 18, 2018 · 12 comments
Open
Labels
A-allocators Area: Custom and system allocators C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@glandium
Copy link
Contributor

glandium commented Dec 18, 2018

I noticed this in Firefox (where we do build with -C lto -C opt-level=2). Essentially any use of an API that ends up calling __rust_realloc generates useless code. A small testcase is:

#[no_mangle]
pub extern "C" fn qux(v: &mut Vec<usize>) {
    v.push(42);
}

which, with the following Cargo.toml config:

[profile.release]
lto = true
opt-level = 2
panic = "abort"
codegen-units = 1
rpath = false

[lib]
crate-type = ["cdylib"]

generates the following assembly:

0000000000003200 <qux>:
    3200:       55                      push   %rbp
    3201:       41 57                   push   %r15
    3203:       41 56                   push   %r14
    3205:       41 55                   push   %r13
    3207:       41 54                   push   %r12
    3209:       53                      push   %rbx
    320a:       50                      push   %rax
    320b:       49 89 fd                mov    %rdi,%r13
    320e:       48 8b 4f 10             mov    0x10(%rdi),%rcx
    3212:       48 3b 4f 08             cmp    0x8(%rdi),%rcx
    3216:       75 55                   jne    326d <qux+0x6d>
    3218:       49 89 ce                mov    %rcx,%r14
    321b:       49 83 c6 01             add    $0x1,%r14
    321f:       0f 82 e2 00 00 00       jb     3307 <qux+0x107>
    3225:       48 8d 04 09             lea    (%rcx,%rcx,1),%rax
    3229:       49 39 c6                cmp    %rax,%r14
    322c:       4c 0f 42 f0             cmovb  %rax,%r14
    3230:       ba 08 00 00 00          mov    $0x8,%edx
    3235:       45 31 ff                xor    %r15d,%r15d
    3238:       4c 89 f0                mov    %r14,%rax
    323b:       48 f7 e2                mul    %rdx
    323e:       49 89 c4                mov    %rax,%r12
    3241:       0f 91 c0                setno  %al
    3244:       0f 80 bd 00 00 00       jo     3307 <qux+0x107>
    324a:       41 88 c7                mov    %al,%r15b
    324d:       49 c1 e7 03             shl    $0x3,%r15
    3251:       48 85 c9                test   %rcx,%rcx
    3254:       74 1d                   je     3273 <qux+0x73>
    3256:       49 8b 6d 00             mov    0x0(%r13),%rbp
    325a:       4d 85 e4                test   %r12,%r12
    325d:       74 3f                   je     329e <qux+0x9e>
    325f:       48 89 ef                mov    %rbp,%rdi
    3262:       4c 89 e6                mov    %r12,%rsi
    3265:       ff 15 35 2d 02 00       callq  *0x22d35(%rip)        # 25fa0 <realloc@GLIBC_2.2.5>
    326b:       eb 6a                   jmp    32d7 <qux+0xd7>
    326d:       49 8b 5d 00             mov    0x0(%r13),%rbx
    3271:       eb 78                   jmp    32eb <qux+0xeb>
    3273:       4d 39 e7                cmp    %r12,%r15
    3276:       76 56                   jbe    32ce <qux+0xce>
    3278:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
    327f:       00 
    3280:       48 89 e7                mov    %rsp,%rdi
    3283:       4c 89 fe                mov    %r15,%rsi
    3286:       4c 89 e2                mov    %r12,%rdx
    3289:       ff 15 39 2d 02 00       callq  *0x22d39(%rip)        # 25fc8 <posix_memalign@GLIBC_2.2.5>
    328f:       85 c0                   test   %eax,%eax
    3291:       75 7b                   jne    330e <qux+0x10e>
    3293:       48 8b 1c 24             mov    (%rsp),%rbx
    3297:       48 85 db                test   %rbx,%rbx
    329a:       75 43                   jne    32df <qux+0xdf>
    329c:       eb 70                   jmp    330e <qux+0x10e>
    329e:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
    32a5:       00 
    32a6:       48 89 e7                mov    %rsp,%rdi
    32a9:       be 08 00 00 00          mov    $0x8,%esi
    32ae:       31 d2                   xor    %edx,%edx
    32b0:       ff 15 12 2d 02 00       callq  *0x22d12(%rip)        # 25fc8 <posix_memalign@GLIBC_2.2.5>
    32b6:       85 c0                   test   %eax,%eax
    32b8:       75 54                   jne    330e <qux+0x10e>
    32ba:       48 8b 1c 24             mov    (%rsp),%rbx
    32be:       48 85 db                test   %rbx,%rbx
    32c1:       74 4b                   je     330e <qux+0x10e>
    32c3:       48 89 ef                mov    %rbp,%rdi
    32c6:       ff 15 d4 2b 02 00       callq  *0x22bd4(%rip)        # 25ea0 <free@GLIBC_2.2.5>
    32cc:       eb 11                   jmp    32df <qux+0xdf>
    32ce:       4c 89 e7                mov    %r12,%rdi
    32d1:       ff 15 a9 2c 02 00       callq  *0x22ca9(%rip)        # 25f80 <malloc@GLIBC_2.2.5>
    32d7:       48 89 c3                mov    %rax,%rbx
    32da:       48 85 db                test   %rbx,%rbx
    32dd:       74 2f                   je     330e <qux+0x10e>
    32df:       49 89 5d 00             mov    %rbx,0x0(%r13)
    32e3:       4d 89 75 08             mov    %r14,0x8(%r13)
    32e7:       49 8b 4d 10             mov    0x10(%r13),%rcx
    32eb:       48 c7 04 cb 2a 00 00    movq   $0x2a,(%rbx,%rcx,8)
    32f2:       00 
    32f3:       49 83 45 10 01          addq   $0x1,0x10(%r13)
    32f8:       48 83 c4 08             add    $0x8,%rsp
    32fc:       5b                      pop    %rbx
    32fd:       41 5c                   pop    %r12
    32ff:       41 5d                   pop    %r13
    3301:       41 5e                   pop    %r14
    3303:       41 5f                   pop    %r15
    3305:       5d                      pop    %rbp
    3306:       c3                      retq   
    3307:       e8 84 9f 00 00          callq  d290 <_ZN5alloc7raw_vec17capacity_overflow17h0432b49d9a8429deE>
    330c:       0f 0b                   ud2    
    330e:       4c 89 e7                mov    %r12,%rdi
    3311:       4c 89 fe                mov    %r15,%rsi
    3314:       e8 87 9f 00 00          callq  d2a0 <_ZN5alloc5alloc18handle_alloc_error17h612a63a56f8c2099E>
    3319:       0f 0b                   ud2    
    331b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

That's a lot of code, and the most notable in here is that there's a call to realloc, two calls to posix_memalign, one to free and one to malloc.

Now, to understand what's going on, let's write our own copy of the GlobalAlloc implementation:

use std::alloc::{GlobalAlloc, Layout};
use std::cmp;
use std::ptr;
use libc;

#[no_mangle]
pub extern "C" fn qux(v: &mut Vec<usize>) {
    v.push(42);
}

struct System;

const MIN_ALIGN: usize = 16; 

unsafe impl GlobalAlloc for System {
    #[inline]
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        if layout.align() <= MIN_ALIGN && layout.align() <= layout.size() {
            libc::malloc(layout.size()) as *mut u8
        } else {
            #[cfg(target_os = "macos")]
            {
                if layout.align() > (1 << 31) {
                    return ptr::null_mut()
                }
            }
            aligned_malloc(&layout)
        }
    }   

    #[inline]
    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
        if layout.align() <= MIN_ALIGN && layout.align() <= layout.size() {
            libc::calloc(layout.size(), 1) as *mut u8
        } else {
            let ptr = self.alloc(layout.clone());
            if !ptr.is_null() {
                ptr::write_bytes(ptr, 0, layout.size());
            }
            ptr
        }
    }   

    #[inline]
    unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
        libc::free(ptr as *mut libc::c_void)
    }   

    #[inline]
    unsafe fn realloc(&self, ptr: *mut u8, layout: Layout, new_size: usize) -> *mut u8 {
        if layout.align() <= MIN_ALIGN && layout.align() <= new_size {
            libc::realloc(ptr as *mut libc::c_void, new_size) as *mut u8
        } else {
            realloc_fallback(self, ptr, layout, new_size)
        }
    }   
}

unsafe fn aligned_malloc(layout: &Layout) -> *mut u8 {
    let mut out = ptr::null_mut();
    let ret = libc::posix_memalign(&mut out, layout.align(), layout.size());
    if ret != 0 { 
        ptr::null_mut()
    } else {
        out as *mut u8
    }   
}

unsafe fn realloc_fallback(
    alloc: &System,
    ptr: *mut u8, 
    old_layout: Layout,
    new_size: usize,
) -> *mut u8 {
    // Docs for GlobalAlloc::realloc require this to be valid:
    let new_layout = Layout::from_size_align_unchecked(new_size, old_layout.align());

    let new_ptr = GlobalAlloc::alloc(alloc, new_layout);
    if !new_ptr.is_null() {
        let size = cmp::min(old_layout.size(), new_size);
        ptr::copy_nonoverlapping(ptr, new_ptr, size);
        GlobalAlloc::dealloc(alloc, ptr, old_layout);
    }   
    new_ptr
}

#[global_allocator]
static HEAP: System = System;

(this compiles to the same assembly)

What's going on here is that layout.align() is, in our inlined case, a constant for the alignment of usize, which is 8 on x86_64, so layout.align() <= MIN_ALIGN is always true. However, because the new_size given to realloc is not NonZero (despite the GlobalAlloc documentation specifying the value can never be 0), LLVM doesn't know that the condition layout.align() <= new_size is also always true, so it ends up creating code for both libc::realloc and realloc_fallback branches.

Changing the condition in realloc to force new_size == 0 to go through libc::realloc gets rid of the call to free and one call to posix_memalign. Sadly, doing the same with layout.size() == 0 in alloc to call libc::malloc doesn't get right of the other posix_memalign call. I guess the compiler still thinks that the new size computed in RawVec::reserve_internal can be somewhere between layout.align() and layout.size(), which is actually not possible.

Assuming we can somehow get the compiler to get rid of that useless posix_memalign, that still leaves us with, essentially, code doing new_ptr = if ptr { libc::realloc(ptr, new_size) } else { libc::malloc(new_size) }, which, considering the API guarantees of libc::realloc, is a branch and a function call too much, since new_ptr = libc::realloc(ptr, new_size) would work just as well, libc::realloc(ptr::null_mut(), new_size) being the same as libc::malloc(new_size). Sadly, there's nothing in the Alloc API that would allow to fold this.

Cc: @SimonSapin

@glandium
Copy link
Contributor Author

A condition in alloc that gets rid of the extra posix_memalign is layout.size() % layout.align() == 0, but that would be really bad for all the other cases where the function is not inlined (which is, in fact, most cases, as it doesn't happen unless LTO is enabled and optimization level is > 1), because that would mean doing a division by a non constant value, and that's very costly.

@glandium
Copy link
Contributor Author

BTW, it's worth noting that in the original assembly, one of the calls to posix_memalign is even for a fixed empty(!) size:

    32a6:       48 89 e7                mov    %rsp,%rdi
    32a9:       be 08 00 00 00          mov    $0x8,%esi
    32ae:       31 d2                   xor    %edx,%edx
    32b0:       ff 15 12 2d 02 00       callq  *0x22d12(%rip)        # 25fc8 <posix_memalign@GLIBC_2.2.5>

The signature of posix_memalign being int posix_memalign(void **memptr, size_t alignment, size_t size) and the argument/register mapping being memptr -> %rdi, alignment -> %rsi, size -> %rdx, we have memptr on the stack, alignment is 8 and size... 0.

@glandium
Copy link
Contributor Author

O_o this doesn't make sense. I replaced libc::malloc with libc::realloc in alloc, which is not great, but would theoretically address the last item, but somehow the resulting code is still calling libc's malloc (and debug info attributes it to the line calling libc::realloc).

@glandium
Copy link
Contributor Author

One thing that addresses the posix_memalign and realloc_fallback issues is to remove the layout.align() <= layout.size() checks entirely. That works out fine because in practice, a type with alignment x has at least size x:

#[repr(align(32))]
struct Foo(u8);

fn main() {
    println!("{}", std::mem::size_of::<Foo>());
}

prints 32, not 1.

That would need more careful consideration, though.

@glandium
Copy link
Contributor Author

O_o this doesn't make sense. I replaced libc::malloc with libc::realloc in alloc, which is not great, but would theoretically address the last item, but somehow the resulting code is still calling libc's malloc (and debug info attributes it to the line calling libc::realloc).

About that...

extern "C" {
    pub fn realloc(p: *mut std::ffi::c_void, size: usize) -> *mut std::ffi::c_void;
}

pub fn foo(s: usize) -> *mut u8 {
    unsafe {
        realloc(std::ptr::null_mut(), s) as *mut u8
    }
}

compiles to

example::foo:
        jmp     qword ptr [rip + malloc@GOTPCREL]

https://rust.godbolt.org/z/5kAI8s

@glandium
Copy link
Contributor Author

FWIW, removing the layout.align() <= layout.size() checks in Firefox saves 151KB(!) of code size... but the code size also grows by 364KB... because more functions now get inlined in their callers (so overall, it's a loss of 213KB).

@SimonSapin
Copy link
Contributor

I think we need to separate how GlobalAllocator is used by Vec from how it’s implemented on std::alloc::System (in this case on Unix platforms).

code doing new_ptr = if ptr { libc::realloc(ptr, new_size) } else { libc::malloc(new_size) }, which, considering the API guarantees of libc::realloc, is a branch and a function call too much

I think this particular branch is in Vec (though it’s based on capacity == 0 rather than ptr.is_null(), so that Vec can contain a ptr::NonNull rather than *mut):

let res = match self.current_layout() {
Some(layout) => {
debug_assert!(new_layout.align() == layout.align());
self.a.realloc(NonNull::from(self.ptr).cast(), layout, new_layout.size())
}
None => self.a.alloc(new_layout),
};

If we made Alloc::realloc behave as Alloc::alloc if passed a null pointer we could merge these two calls in raw_vec.rs into one, but we’d still need the branch on capacity == 0. (And I’m not sure how we’d deal in GlobalAlloc and/or in impl Alloc for Global.)


In <System as GlobalAlloc>::realloc we have another branch:

if layout.align() <= MIN_ALIGN && layout.align() <= new_size {
libc::realloc(ptr as *mut libc::c_void, new_size) as *mut u8
} else {
realloc_fallback(self, ptr, layout, new_size)
}

I think this one is necessary because:

  • layout.align() could be very large, but libc::realloc doesn’t take an alignment parameter so to use it we need to compare against MIN_ALIGN (which is our best guess for _Alignof (max_align_t) for the target platform)

  • libc APIs that do take an alignment parameter (memalign, posix_memalign) don’t take an old pointer to copy data from, so when using them we need to copy and free manually.

In #45955 we found out that in practice, implementations of malloc/realloc don’t always return a pointer aligned to _Alignof (max_align_t): if the requested size is smaller than that, they assume that the alignment does not need to be larger than the size. I added layout.align() <= new_size in 21d8992 as an attempt to fix this.

However, because the new_size given to realloc is not NonZero (despite the GlobalAlloc documentation specifying the value can never be 0), LLVM doesn't know that the condition layout.align() <= new_size is also always true, so it ends up creating code for both libc::realloc and realloc_fallback branches.

new_size > 0 does not imply layout.align() <= new_size, so I assume you mean something like this in the Vec case:

  1. new_size is new_capacity * size_of::<T>()
  2. layout.align() is align_of::<T>() with the same T
  3. size_of::<T>() is a multiple of align_of::<T>() for all Rust types T: Sized
  4. Therefore, new_size is a multiple of layout.align()
  5. A non-negative (unsigned) multiple can only be smaller than its divisor if it is zero
  6. NonZero excludes this case

I don’t know if the optimizer is aware of point 6 (as I understand it, NonZero is mostly special in terms of memory layout) or point 3, or if it is sophisticated enough for the algorithmic reasoning in point 5.

A condition in alloc that gets rid of the extra posix_memalign is layout.size() % layout.align() == 0

Does that fix #45955 ?

saves 151KB(!) of code size... but the code size also grows by 364KB... because more functions now get inlined in their callers

Would adding #[inline(never)] somewhere be a good way to reduce code size? Where?

@SimonSapin
Copy link
Contributor

to remove the layout.align() <= layout.size() checks entirely. That works out fine because in practice, a type with alignment x has at least size x

But Layout has a from_size_align(usize, usize) constructor, not just new::<T>().

@SimonSapin
Copy link
Contributor

I suppose a larger question is: is there any use case for an allocation request more aligned than its size? If not, should we forbid it? If so, how do we deal with existing stable APIs that do not enforce or document this constraint? (Introducing new APIs and deprecating existing ones is always an option, but should be a last resort.)

@nagisa
Copy link
Member

nagisa commented Dec 18, 2018

I suppose a larger question is: is there any use case for an allocation request more aligned than its size?

Yes. A likely and common scenario here would be a request for a page-aligned allocation that is not necessarily spanning the whole page.

@SimonSapin
Copy link
Contributor

Can you say more about how that would be used? Can allocator implementations typically do that and still use the rest of the page for other allocations?

@nagisa
Copy link
Member

nagisa commented Dec 18, 2018

Can you say more about how that would be used?

It is common for low-level functions to have requirements for page-size alignment. Something that comes to mind is a stack (platforms usually have their own respective alignment requirements, but page-aligned allocations are safe to use on a all known-and-relevant targets).

It is heckin' convenient to be able to write Layout::from_size_align(x, PAGE_SZ) and not have to think about x being less than PAGE_SZ.

Can allocator implementations typically do that and still use the rest of the page for other allocations?

Not sure if your usual implementation of bin/buddy allocator (what jemalloc is) will bother handling this case, but I’m fairly sure that allocators based on e.g. free list will gladly use the rest of the page for smaller allocations.

@Centril Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. I-heavy Issue: Problems and improvements with respect to binary size of generated code. labels Dec 20, 2018
@jonas-schievink jonas-schievink added A-allocators Area: Custom and system allocators T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Apr 14, 2019
@Enselic Enselic added the C-bug Category: This is a bug. label Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-allocators Area: Custom and system allocators C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants