Revert optimisations to `findall(f, ::AbstractArray{Bool})` #49831

jakobnissen · 2023-05-16T13:32:09Z

In commit 4c4c94f, findall(f, A::AbstractArray{Bool}) was optimised by using a technique where A was traversed twice: Once to count the number of true elements and once to fill in the resulting vector.
However, this could cause problems for arbitrary functions f: For slow f, the approach is ~2x slower. For impure f, f being called twice could cause side effects and strange issues (see issue #46425)

With this commit, the optimised version is only dispatched to when f is ! or identity.

Note that this re-introduces the performance regression from #42187 by reverting some of the optimisation in #42202

Closes issue #46425

In commit 4c4c94f, findall(f, A::AbstractArray{Bool}) was optimised by using a technique where A was traversed twice: Once to count the number of true elements and once to fill in the resulting vector. However, this could cause problems for arbitrary functions f: For slow f, the approach is ~2x slower. For impure f, f being called twice could cause side effects and strange issues (see issue JuliaLang#46425) With this commit, the optimised version is only dispatched to when f is ! or identity.

adienes · 2023-05-16T14:40:31Z

let A be length n and the number of true elements be t, then am I correct that the comparison here is ?

call f on 2n elements, create one vector of size t
call f on n elements, create one vector of size 0 and resize as needed

Which approach is faster I think is very conditional on all of n, t, and the performance of f, so trying to guess which is best based on only the function type sounds a little too magical. even with !, couldn't this be slower if n >> t ?

And it leaves further optimization on the table when t is known exactly in advance. this suggests to me it might be a good idea to have, if t is known in advance

findall!(idxs, f, A)

Where length(idxs) == t and what would be the output of findall(f, A) is written into idxs. Then findall will always choose the resizing approach, perhaps with a keyword findall(; count_first=:auto) or something if the user desires "magic" as above for the 2n approach, which simply allocates a vector and dispatches to findall!

Also, the case of an impure predicate is probably a good candidate to add a unit test for

vtjnash · 2023-05-16T17:11:45Z

base/array.jl

 end

-findall(f::Function, A::AbstractArray{Bool}) = _findall(f, A)
+findall(f::Union{typeof(!), typeof(identity)}, A::AbstractArray{Bool}) = _findall(f, A)


@aviatesk this seems like another case where ideally we should be dispatching on the Core.Compiler.infer_effect result for f(A) to the best choice of algorithm.

StefanKarpinski · 2023-07-20T18:44:31Z

I marked this for triage commentary and the main takeaways are that everyone feels that the existing optimization is pretty icky and would love to see some benchmarks about whether it's worth it or not to ever do, especially since our vector growing code has gotten a fair amount of work in the past couple of years and might be good enough now to make this no longer worthwhile. We could also add a sizehint keyword argument that lets you specify how big you think the result would be. The optimized version could then be expressed by doing findall(pred, v, sizehint=count(pred, v)) which is kind of verbose but is at least explicit about what's being done. We could also maybe havesizehint=true mean to do this without having to spell it out.

StefanKarpinski · 2023-07-20T18:57:30Z

I do think it's compelling that we should not evaluate predicates with side effects multiple times, we should restrict the optimization to predicates without side effects. Now that we can infer effects it seems better to just avoid multiple evaluation in cases where getindex or the predicate have side effects. TBD what effect set is allowable. Another thing that came up on the triage call is that it could be even more efficient to take a random sample of values in the array to evaluate and use that estimate a sizehint and then do the pushing implementation.

StefanKarpinski · 2023-07-20T19:02:51Z

So my overall proposal is to add the sizehint keyword argument, which

defaults to false if getindex and the predicate aren't sufficiently pure
defaults to count(pred, v) if they are pure enough and v is small (for some cutoff)
defaults to a sample-based 90th percentile binomial estimate for large enough v

If the user wants to provide a size hint when getindex or the predicate aren't pure enough, then they can pass it manually and we could expose the estimator utility function.

LilithHafner · 2023-07-20T20:09:03Z

In theory, a random sample based preallocation could be optimal, but I have yet to see that approach outperform the two-pass approach which preallocates. I propose adding the guarantee "this won't call f/getindex more than once per element", gating the preallocation optimization based on inferred effects, and keeping all those optimizations internal for now.

I have a hard time picturing a use case where it is appropriate for a user to dive deep enough into findall optimizations that it is warranted to provide a sizehint keyword so I'd like to keep the optimizations internal until someone asks to use them.

oscardssmith · 2023-07-20T20:16:25Z

It is also probably worth keeping in mind the fact that high performance use cases will likely want to use findfirst+findnext rather than findall anyway.

jakobnissen · 2023-07-22T19:33:02Z

TL;DR: We should undo all my "optimizations" and revert back to #37177

Implementations

Based on triage's comments and a discussion with @LilithHafner , I've now done some benchmarking on various implementations. It's a little tricky because the speed depends on several factors, including

The size of the input array
The dimensionality of the input array (since that changes the element type of the result)
The size of the output array (i.e. how often f(element) == true)
How long f takes to evaluate

I've tested 4 different implementations:

manual_push is the naive solution that initializes a vector of indices, then pushes one index at a time, but with the added optimization that instead of actually calling push! which is slow, it @inbounds setindex! and doubles the size of the output array if necessary.
The existing findall
f(f, A) = findall((f.(A))::BitArray). This is the old implementation from Make findall faster for AbstractArrays #37177
twopass. This implementation first converts (f, A) to an equivalent function and array where computing f(A[i]) is fast and pure. In my test implementation, I simply made it return the inputs in the cases where that was the case, and in cases where it wasn't, I returned (identity, (f.(A))::BitArray)

Note that if we take away the "ickiness" of the existing optimization, then the existing implementations are either 1. or 4. depending on dispatch.

Benchmarks

1 is overall slowest in nearly every case, presumably from the many resizes and overallocation. We should not use that for A::AbstractArray. We do currently use a Base.push! equivalent for A::Any, but I think that is unavoidable. I would rather have Base.push! be made faster than do an equivalent of manual_push in that case.
4 is fastest for short input arrays, and sometimes also competitve for medium-long array (~30k elements) when the result size is small. Presumably this is because it avoids allocating a large bitarray filled mostly with zeros. However, its advantage was only notable for small arrays (tens to hundreds of elements).
3 is fastest for long arrays. One might think that it would be very slow for long arrays since it needs to allocate a huge bitarray filled with mostly zeros, but findall(::BitArray) is very efficient at just skipping these zeros.

Benchmarks:
Times given are relative to findall. Arguments for the 6 tests are:

[
    (!, A1),
    (identity, A2),
    (<(0.00001), A3),
    (<(0.99), A3),
    (>(0.5), A4),
    (isodd, A5)
]

Benchmarks (Click me!)

1 length = 0 Push 0.171
1 length = 0 Twopass 0.143
1 length = 0 BitVect 1.107

1 length = 32 Push 0.416
1 length = 32 Twopass 0.301
1 length = 32 BitVect 1.811

1 length = 1024 Push 1.253
1 length = 1024 Twopass 0.888
1 length = 1024 BitVect 1.77

1 length = 32224 Push 1.541
1 length = 32224 Twopass 0.934
1 length = 32224 BitVect 0.939

1 length = 1000000 Push 1.571
1 length = 1000000 Twopass 0.938
1 length = 1000000 BitVect 0.876

2 length = 0 Push 1.124
2 length = 0 Twopass 1.15
2 length = 0 BitVect 9.468

2 length = 32 Push 1.699
2 length = 32 Twopass 0.811
2 length = 32 BitVect 5.291

2 length = 1024 Push 2.436
2 length = 1024 Twopass 0.781
2 length = 1024 BitVect 1.028

2 length = 32224 Push 1.853
2 length = 32224 Twopass 0.727
2 length = 32224 BitVect 0.787

2 length = 1000000 Push 1.81
2 length = 1000000 Twopass 0.664
2 length = 1000000 BitVect 0.672

3 length = 0 Push 0.426
3 length = 0 Twopass 0.429
3 length = 0 BitVect 3.426

3 length = 32 Push 0.567
3 length = 32 Twopass 0.275
3 length = 32 BitVect 2.371

3 length = 1024 Push 1.393
3 length = 1024 Twopass 0.098
3 length = 1024 BitVect 1.253

3 length = 32224 Push 7.043
3 length = 32224 Twopass 0.318
3 length = 32224 BitVect 1.04

3 length = 1000000 Push 6.217
3 length = 1000000 Twopass 3.021
3 length = 1000000 BitVect 1.004

4 length = 0 Push 0.385
4 length = 0 Twopass 0.411
4 length = 0 BitVect 3.331

4 length = 32 Push 1.478
4 length = 32 Twopass 0.435
4 length = 32 BitVect 2.218

4 length = 1024 Push 1.292
4 length = 1024 Twopass 0.627
4 length = 1024 BitVect 1.122

4 length = 32224 Push 2.051
4 length = 32224 Twopass 0.969
4 length = 32224 BitVect 1.008

4 length = 1000000 Push 3.17
4 length = 1000000 Twopass 1.121
4 length = 1000000 BitVect 1.01

5 length = 0 Push 0.406
5 length = 0 Twopass 0.394
5 length = 0 BitVect 3.208

5 length = 32 Push 0.514
5 length = 32 Twopass 0.421
5 length = 32 BitVect 2.041

5 length = 1024 Push 1.038
5 length = 1024 Twopass 0.591
5 length = 1024 BitVect 1.115

5 length = 32224 Push 2.422
5 length = 32224 Twopass 0.959
5 length = 32224 BitVect 1.018

5 length = 1000000 Push 2.252
5 length = 1000000 Twopass 1.091
5 length = 1000000 BitVect 0.986

6 length = 0 Push 0.427
6 length = 0 Twopass 3.695
6 length = 0 BitVect 3.228

6 length = 32 Push 0.474
6 length = 32 Twopass 2.32
6 length = 32 BitVect 2.09

6 length = 1024 Push 1.302
6 length = 1024 Twopass 1.142
6 length = 1024 BitVect 1.113

6 length = 32224 Push 1.854
6 length = 32224 Twopass 1.03
6 length = 32224 BitVect 0.992

6 length = 1000000 Push 3.328
6 length = 1000000 Twopass 1.024
6 length = 1000000 BitVect 1.018

Conclusion

From my benchmarks, the conclusion is that 3. is the best overall solution, i.e. we simply revert everything back to #37177. We will regress in the sense that #42187 will re-appear, but this doesn't matter IMO. It's fine that findall is "slow" when we're talking a few hundred nanoseconds extra for a function that allocates anyway. Truly high-performance code would not use findall anyway, and it's most important to be fast for long arrays.

Q&A

What about an implementation just using `push!`?

Based on initial benchmarking (not shown in the benchmarking results), this version is strictly worse than implementation 1, and thus not worth considering.

What about using `infer_effects` and/or inlining heuristics to determine whether a two-pass solution is optimal?

Given that the two-pass solution is about on-par with the BitArray solution even in the benchmark cases where I know for sure that f is fast and pure, it's not worth it. Unless we really want to save half a microsecond for when calling findall on small inputs. If twopass was way faster than the bitarray solution, then it might have been worth investigating more closely how this could be done (#50624), but it isn't.

jakobnissen force-pushed the findall branch from 33b2293 to 8e24e5e Compare May 16, 2023 13:33

vtjnash reviewed May 16, 2023

View reviewed changes

jakobnissen marked this pull request as draft July 7, 2023 11:20

StefanKarpinski added the triage This should be discussed on a triage call label Jul 18, 2023

jakobnissen added 2 commits July 22, 2023 21:39

Revert optimizations to findall

1ff42dc

Avoid stackoverflow

a102f5c

jakobnissen changed the title ~~Only use optimised findall for simple boolean functions~~ Revert optimisations to findall(f, ::AbstractArray{Bool}) Jul 22, 2023

jakobnissen removed the triage This should be discussed on a triage call label Jul 22, 2023

Do not typeassert BitArray

e29947c

JeffBezanson marked this pull request as ready for review August 24, 2023 18:24

JeffBezanson marked this pull request as draft August 24, 2023 18:25

jakobnissen force-pushed the findall branch 2 times, most recently from cae8886 to e29947c Compare April 18, 2024 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert optimisations to `findall(f, ::AbstractArray{Bool})` #49831

Revert optimisations to `findall(f, ::AbstractArray{Bool})` #49831

jakobnissen commented May 16, 2023 •

edited

Loading

adienes commented May 16, 2023

vtjnash May 16, 2023

StefanKarpinski commented Jul 20, 2023

StefanKarpinski commented Jul 20, 2023

StefanKarpinski commented Jul 20, 2023

LilithHafner commented Jul 20, 2023

oscardssmith commented Jul 20, 2023

jakobnissen commented Jul 22, 2023 •

edited

Loading

Revert optimisations to findall(f, ::AbstractArray{Bool}) #49831

Are you sure you want to change the base?

Revert optimisations to findall(f, ::AbstractArray{Bool}) #49831

Conversation

jakobnissen commented May 16, 2023 • edited Loading

adienes commented May 16, 2023

vtjnash May 16, 2023

Choose a reason for hiding this comment

StefanKarpinski commented Jul 20, 2023

StefanKarpinski commented Jul 20, 2023

StefanKarpinski commented Jul 20, 2023

LilithHafner commented Jul 20, 2023

oscardssmith commented Jul 20, 2023

jakobnissen commented Jul 22, 2023 • edited Loading

Implementations

Benchmarks

Conclusion

Q&A

What about an implementation just using push!?

What about using infer_effects and/or inlining heuristics to determine whether a two-pass solution is optimal?

Revert optimisations to `findall(f, ::AbstractArray{Bool})` #49831

Revert optimisations to `findall(f, ::AbstractArray{Bool})` #49831

jakobnissen commented May 16, 2023 •

edited

Loading

jakobnissen commented Jul 22, 2023 •

edited

Loading

What about an implementation just using `push!`?

What about using `infer_effects` and/or inlining heuristics to determine whether a two-pass solution is optimal?