Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type instability (allocations) regression in Julia v1.10-alpha1 #50562

Closed
ranocha opened this issue Jul 15, 2023 · 16 comments · Fixed by #51009
Closed

Type instability (allocations) regression in Julia v1.10-alpha1 #50562

ranocha opened this issue Jul 15, 2023 · 16 comments · Fixed by #51009
Assignees
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@ranocha
Copy link
Member

ranocha commented Jul 15, 2023

We have some code in Trixi.jl that checks

if :i_backward in secondary_indices

where, secondary_indices is a tuple of Symbols, see https://github.com/trixi-framework/Trixi.jl/blob/07981c3e50943a9bdb1a845f918a4e8831714338/src/solvers/dgsem_p4est/dg_2d.jl#L148

We observed a significantly increased amount of allocations when checking CI with Julia v1.10.0-alpha1 in trixi-framework/Trixi.jl#1562. Locally, I checked that the allocations are gone when modifying the code above to

if any(==(:i_backward), secondary_indices)
@ranocha ranocha added the regression Regression in behavior compared to a previous version label Jul 15, 2023
@aviatesk
Copy link
Member

Is secondary_indices compile-time constant?

@ranocha
Copy link
Member Author

ranocha commented Jul 15, 2023

No, it is only known at runtime

@aviatesk
Copy link
Member

Can you please give more details about the code? I can't confirm any allocation difference between the code snippets you shared above:

julia> global secondary_indices::Tuple{Symbol,Symbol,Symbol} = (:a, :b, :c)
(:a, :b, :c)

julia> f1() = :i_backward in secondary_indices
f1 (generic function with 1 method)

julia> f2() = any(==(:i_backward), secondary_indices)
f2 (generic function with 1 method)

julia> @benchmark f2()
[ Info: Loading BenchmarkTools...
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.708 ns  11.125 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.833 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.842 ns ±  0.213 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         █                      ▃             
  ▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▄ ▂
  1.71 ns        Histogram: frequency by time        1.92 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f1()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  4.084 ns  14.917 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.250 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.261 ns ±  0.324 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▂                  █         ▆           
  ▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▃ ▂
  4.08 ns        Histogram: frequency by time        4.33 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@ranocha
Copy link
Member Author

ranocha commented Jul 16, 2023

An MWE seems to be

julia> v = Vector{NTuple{2, Symbol}}(undef, 100);

julia> for i in eachindex(v)
           values = (:i, :i_backward, :j, :j_backward)
           v[i] = (rand(values), rand(values))
       end

julia> function f1(v)
           res = 0
           for secondary_indices in v
               if :i_backward in secondary_indices
                   res += 1
               else
                   res += 2
               end
           end
           return res
       end
f1 (generic function with 1 method)

julia> function f2(v)
           res = 0
           for secondary_indices in v
               if any(==(:i_backward), secondary_indices)
                   res += 1
               else
                   res += 2
               end
           end
           return res
       end
f2 (generic function with 1 method)

julia> using BenchmarkTools

julia> f1(v)
162

julia> f2(v)
162

julia> @benchmark f1(v)
BenchmarkTools.Trial: 10000 samples with 152 evaluations.
 Range (min  max):  672.553 ns    6.856 μs  ┊ GC (min  max): 0.00%  83.65%
 Time  (median):     736.796 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   794.691 ns ± 429.613 ns  ┊ GC (mean ± σ):  5.35% ±  8.51%

  █▇▂▁▂                                                         ▂
  ████████▅▅▄▁▃▄▆█▃▃▄▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ █
  673 ns        Histogram: log(frequency) by time       4.64 μs <

 Memory estimate: 3.12 KiB, allocs estimate: 100.

julia> @benchmark f2(v)
BenchmarkTools.Trial: 10000 samples with 965 evaluations.
 Range (min  max):  81.077 ns  273.202 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     81.771 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   83.106 ns ±   7.061 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄                                                        ▁ ▁
  ███▄▆▇▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▃▁▁▃▅█▇▆▄▇▇▇▆▆▄▃▁▃▃▃▃▃▁▁▁▁▁▁▁▁▃▄▃▄▇▇▅▇█ █
  81.1 ns       Histogram: log(frequency) by time       113 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

That's with Julia v1.10.0-alpha1. With the current stable release Julia v1.9.2, I get

julia> @benchmark f1(v)
BenchmarkTools.Trial: 10000 samples with 961 evaluations.
 Range (min  max):  78.634 ns  204.733 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     86.562 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.574 ns ±   6.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁          ▇█  ▁                                             ▁
  █▇▁▁▁▁▁▃▁▁▁██▇██▅▁▅▃▁▄▃▄▁▁▃▃▁▄▃▅█▇▅▇█▆▅▆▄▃▃▄▃▁▁▃▃▁▁▁▄▆▄▄▇▇▆█ █
  78.6 ns       Histogram: log(frequency) by time       118 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f2(v)
BenchmarkTools.Trial: 10000 samples with 961 evaluations.
 Range (min  max):  80.238 ns  267.342 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     86.915 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.485 ns ±   7.884 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂█ ▂                                  ▁               ▁
  ▇▁▃▁▁▁▁██▅█▅▄▁▁▁▃▃▁▁▃▁▁▄█▆▆▇▆▇▅▁▁▄▄▁▁▃▄▁▅▆▄▇▇█▇▆▅▄▃▄▁▁▃▄▃▄▄▄ █
  80.2 ns       Histogram: log(frequency) by time       129 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@aviatesk
Copy link
Member

Yes, thanks. I could reproduce it with this MRE also:

julia> f1(rt) = Base.sym_in(:i_backward, rt[]);

julia> f2(rt) = @inline Base.sym_in(:i_backward, rt[]);

julia> rt = Ref((:b_back, :foofakgka, :i_backw));

julia> f1(rt); @allocated f1(rt)
32

julia> f2(rt); @allocated f2(rt)
0

So this is caused by the @noinline annotation added on Base.sym_in.

julia/base/tuple.jl

Lines 609 to 610 in d215d91

function sym_in(x::Symbol, @nospecialize itr::Tuple{Vararg{Symbol}})
@noinline

As the MRE above shows, we can eliminate the allocation if we remove it. Alternatively, we can achieve the same result if we remove the @nospecialize annotation.

The @noinline annotation was added in #48066, which I am not sure about its rationale. @Keno Could you tell me why we want to have @noinline on Base.sym_in? Do you think we can remove either of the annotations?

@aviatesk
Copy link
Member

FWIW, the sym_in method was added in #36596 by @vtjnash along with @nospecialize annotation on itr argument.

@Keno
Copy link
Member

Keno commented Jul 16, 2023

@Keno Could you tell me why we want to have @noinline on Base.sym_in? Do you think we can remove either of the annotations?

The problem is that inlining these functions destroys the effect assumptions, so they can't be constproped, even if irinterp later finds the constants. What object is being allocated here? Can we improve the codegen to avoid that happening?

@aviatesk
Copy link
Member

It appears that the allocation is triggered by the construction of a Vararg tuple, which is used for dispatching to Base.sym_in:

julia> @code_typed f1(rt)
CodeInfo(
1%1 = Base.getfield(rt, :x)::Tuple{Symbol, Symbol, Symbol}%2 = invoke Base.sym_in(:i_backward::Symbol, %1::Tuple{Vararg{Symbol}})::Bool
└──      return %2
) => Bool

julia> @code_llvm f1(rt)
;  @ REPL[1]:1 within `f1`
define i8 @julia_f1_1641({}* noundef nonnull align 8 dereferenceable(24) %"rt::RefValue") #0 {
top:
  %gcframe13 = alloca [5 x {}*], align 16
  %gcframe13.sub = getelementptr inbounds [5 x {}*], [5 x {}*]* %gcframe13, i64 0, i64 0
  %0 = bitcast [5 x {}*]* %gcframe13 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %0, i8 0, i64 40, i1 true)
  %1 = call {}*** inttoptr (i64 6855994896 to {}*** (i64)*)(i64 261) #9
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %2 = bitcast [5 x {}*]* %gcframe13 to i64*
    store i64 12, i64* %2, align 16
    %3 = load {}**, {}*** %1, align 8
    %4 = getelementptr inbounds [5 x {}*], [5 x {}*]* %gcframe13, i64 0, i64 1
    %5 = bitcast {}** %4 to {}***
    store {}** %3, {}*** %5, align 8
    %6 = bitcast {}*** %1 to {}***
    store {}** %gcframe13.sub, {}*** %6, align 8
    %7 = bitcast {}* %"rt::RefValue" to {}**
    %.unpack = load {}*, {}** %7, align 8
    %.not = icmp eq {}* %.unpack, null
    br i1 %.not, label %fail, label %pass

fail:                                             ; preds = %top
    call void @ijl_throw({}* inttoptr (i64 5353759904 to {}*))
    unreachable

pass:                                             ; preds = %top
    %8 = bitcast {}* %"rt::RefValue" to [3 x {}*]*
    %.elt2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %8, i64 0, i64 1
    %.unpack3 = load {}*, {}** %.elt2, align 8
    %.elt4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %8, i64 0, i64 2
    %.unpack5 = load {}*, {}** %.elt4, align 8
    %9 = getelementptr inbounds [5 x {}*], [5 x {}*]* %gcframe13, i64 0, i64 4
    store {}* %.unpack, {}** %9, align 16
    %10 = getelementptr inbounds [5 x {}*], [5 x {}*]* %gcframe13, i64 0, i64 3
    store {}* %.unpack5, {}** %10, align 8
    %11 = getelementptr inbounds [5 x {}*], [5 x {}*]* %gcframe13, i64 0, i64 2
    store {}* %.unpack3, {}** %11, align 16
; └└
  %ptls_field14 = getelementptr inbounds {}**, {}*** %1, i64 2
  %12 = bitcast {}*** %ptls_field14 to i8**
  %ptls_load1516 = load i8*, i8** %12, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %ptls_load1516, i32 1184, i32 32) #7
  %13 = bitcast {}* %box to i64*
  %14 = getelementptr inbounds i64, i64* %13, i64 -1
  store atomic i64 5280480960, i64* %14 unordered, align 8
  %15 = bitcast {}* %box to [3 x {}*]*
  %.repack = bitcast {}* %box to {}**
  store {}* %.unpack, {}** %.repack, align 8
  %.repack8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %15, i64 0, i64 1
  store {}* %.unpack3, {}** %.repack8, align 8
  %.repack10 = getelementptr inbounds [3 x {}*], [3 x {}*]* %15, i64 0, i64 2
  store {}* %.unpack5, {}** %.repack10, align 8
  store {}* %box, {}** %11, align 16
  %16 = call i8 @j_sym_in_1643({}* inttoptr (i64 4336660832 to {}*), {}* nonnull readonly %box)
  %17 = load {}*, {}** %4, align 8
  %18 = bitcast {}*** %1 to {}**
  store {}* %17, {}** %18, align 8
  ret i8 %16
}

In particular, this seems to be the source of allocation %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %ptls_load1516, i32 1184, i32 32) #7.

Alternatively we can remove the @nospecialize annotation, although it forces us to compile Base.sym_in multiple times for each tuples with different length.

@Keno
Copy link
Member

Keno commented Jul 17, 2023

We could potentially consider adjusting the specsig codegen ABI for Tuple{Vararg{T}} to separate pass the pointer and the length to avoid having to do this allocation.

@vtjnash
Copy link
Member

vtjnash commented Jul 17, 2023

That seems like what the ABI was supposed to be for this. Why is it getting a specsig for the Vararg, when non-specsig should have "done the right thing" instead.

@Keno
Copy link
Member

Keno commented Jul 17, 2023

We have special handling for a vararg signature, but not for tuples of unknown size.

@ranocha
Copy link
Member Author

ranocha commented Aug 18, 2023

Are there any plans to fix this in Julia v1.10.0 or do we need to introduce workarounds in our packages?

@oscardssmith oscardssmith added this to the 1.10 milestone Aug 18, 2023
@oscardssmith
Copy link
Member

A 10x regression seems like it should be on the milestone

@ranocha
Copy link
Member Author

ranocha commented Aug 18, 2023

Thanks for adding it to the milestone!

@JeffBezanson JeffBezanson added the performance Must go faster label Aug 18, 2023
@JeffBezanson
Copy link
Member

For 1.10 we can remove the noinline, but the correct fix is to add ABI support for passing a tuple like this without boxing (or boxing on the stack).

@JeffBezanson JeffBezanson self-assigned this Aug 22, 2023
@JeffBezanson JeffBezanson removed this from the 1.10 milestone Aug 29, 2023
KristofferC added a commit that referenced this issue Oct 2, 2023
Backported PRs:
- [x] #48625 <!-- add replace(io, str, patterns...) -->
- [x] #48387 <!-- Improve documentation of sort-related functions -->
- [x] #48363 <!-- Revise sort.md and docstrings in sort.jl -->
- [x] #48977 <!-- Update SparseArrays.jl stdlib for SuiteSparse 7 -->
- [x] #50719 <!-- fix `CyclePadding(::DataType)` -->
- [x] #50694 <!-- inference: permit recursive type traits -->
- [x] #50860 <!-- Add `Base.get_extension` to docs/API -->
- [x] #50594 <!-- Disallow non-index Integer types in isassigned -->
- [x] #50802 <!-- Makes IntrusiveLinkedListSynchronized mutable to avoid
allocation on poptask -->
- [x] #50858 <!-- Add a `threadpool` parameter to `Channel` constructor
-->
- [x] #50874 <!-- Restrict COFF to a single thread when symbol count is
high -->
- [x] #50822 <!-- Add default method for setmodifiers! -->
- [x] #50730 <!-- Fix integer overflow in `isapprox` -->
- [x] #50850 <!-- Remove weird Rational dispatch and add pi functions to
list -->
- [x] #50809 <!-- Limit type-printing in MethodError -->
- [x] #50915 <!-- Add note the `Task` about sticky bit -->
- [x] #50929 <!-- when widening tuple types in tmerge, only widen the
complex parts -->
- [x] #50928 <!-- Bump JuliaSyntax to 0.4.6 -->
- [x] #50959 <!-- Update libssh2 patches -->
- [x] #50823 <!-- Make ranges more robust with unsigned indexes. -->
- [x] #48542 <!-- Add docs on task-specific buffering using
multithreading -->
- [x] #50912 <!-- Separate foreign threads into a :foreign threadpool
-->
- [x] #51010 <!-- Add ORIGIN to SuiteSparse rpath on Linux/FreeBSD -->
- [x] #50753 <!-- cat: remove unused promote_eltype methods that confuse
inference -->
- [x] #51027 <!-- Implement realloc accounting correctly -->
- [x] #51019 <!-- fix a case of potentially use of undefined variable
when handling error in distributed message processing -->
- [x] #51039 <!-- Revert "Optimize findall(f, ::AbstractArray{Bool})
(#42202)" -->
- [x] #51036 <!-- add missing invoke edge for nospecialize targets -->
- [x] #51042 <!-- inference: fix return_type_tfunc modeling of concrete
functions -->
- [x] #51114 <!-- Workaround upstream FreeBSD issue #272992 -->
- [x] #50892 <!-- Add `JL_DLLIMPORT` to `small_typeof` declaration -->
- [x] #51154 <!-- broadcast: use recursion rather than ntuple to map
over a tuple -->
- [x] #51153 <!-- fix debug typo in "add missing invoke edge for
nospecialize targets (#51036)" -->
- [x] #51222 <!-- Check again if the tty is open inside the IO lock -->
- [x] #51236 <!-- Add lock around uv_unref during init -->
- [x] #51243 <!-- GMP: Gracefully handle more overflows. -->
- [x] #51254 <!-- Ryu: make sure adding zeros does not overwrite
trailing dot -->
- [x] #51175 <!-- shorten stale_age for cachefile lock -->
- [x] #51300 <!-- fix method definition error for bad vararg -->
- [x] #51307 <!-- fix force-throw ctrl-C on Windows -->
- [x] #51303 <!-- ensure revising structs is safe -->
- [x] #51393 
- [x] #51403 

Need manual backport:
- [x] #51009 <!-- fix #50562, regression in `in` of tuple of Symbols -->
- [x] #51053 <!-- Bump Statistics stdlib -->
- [x] #51013 <!-- fix #50709, type bound error due to free typevar in
sparam env -->
- [x] #51305 <!-- reduce test time for rounding and floatfuncs -->

Contains multiple commits, manual intervention needed:
- [ ] #50663 <!-- Fix Expr(:loopinfo) codegen -->
- [ ] #51035 <!-- refactor GC scanning code to reflect jl_binding_t are
now first class -->
- [ ] #51092 <!-- inference: fix bad effects for recursion -->
- [x] #51247 <!-- Check if malloc has succeeded before incrementing gc
stats -->
- [x] #51294 <!-- use LibGit2_jll for LibGit2 library -->

Non-merged PRs with backport label:
- [ ] #51132 <!-- Handle `AbstractQ` in concatenation -->
- [x] #51029 <!-- testdefs: make sure that if a test set changes the
active project, they change it back when they're done -->
- [ ] #50919 <!-- Code loading: do the "skipping mtime check for stdlib"
check regardless of the value of `ispath(f)` -->
- [ ] #50824 <!-- Add some aliasing warnings to docstrings for mutating
functions -->
- [x] #50385 <!-- Precompile pidlocks: add to NEWS and docs -->
- [ ] #49805 <!-- Limit TimeType subtraction to AbstractDateTime -->
vtjnash pushed a commit that referenced this issue Oct 4, 2023
This seems to be the right combination of annotations to fix both #50562
and an inference regression in PropertyDicts.jl on the 1.10 release
branch. When backported the `@noinline` should be restored for 1.10.
KristofferC pushed a commit that referenced this issue Oct 10, 2023
This seems to be the right combination of annotations to fix both #50562
and an inference regression in PropertyDicts.jl on the 1.10 release
branch. When backported the `@noinline` should be restored for 1.10.

(cherry picked from commit f7e8f92)
@aviatesk
Copy link
Member

This issue was fixed. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
6 participants