Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert optimisations to findall(f, ::AbstractArray{Bool}) #49831

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented May 16, 2023

In commit 4c4c94f, findall(f, A::AbstractArray{Bool}) was optimised by using a technique where A was traversed twice: Once to count the number of true elements and once to fill in the resulting vector.
However, this could cause problems for arbitrary functions f: For slow f, the approach is ~2x slower. For impure f, f being called twice could cause side effects and strange issues (see issue #46425)

With this commit, the optimised version is only dispatched to when f is ! or identity.

Note that this re-introduces the performance regression from #42187 by reverting some of the optimisation in #42202

Closes issue #46425

In commit 4c4c94f, findall(f, A::AbstractArray{Bool}) was optimised by using a
technique where A was traversed twice: Once to count the number of true elements
and once to fill in the resulting vector.
However, this could cause problems for arbitrary functions f: For slow f, the
approach is ~2x slower. For impure f, f being called twice could cause side
effects and strange issues (see issue JuliaLang#46425)

With this commit, the optimised version is only dispatched to when f is ! or
identity.
@adienes
Copy link
Contributor

adienes commented May 16, 2023

let A be length n and the number of true elements be t, then am I correct that the comparison here is ?

  • call f on 2n elements, create one vector of size t
  • call f on n elements, create one vector of size 0 and resize as needed

Which approach is faster I think is very conditional on all of n, t, and the performance of f, so trying to guess which is best based on only the function type sounds a little too magical. even with !, couldn't this be slower if n >> t ?

And it leaves further optimization on the table when t is known exactly in advance. this suggests to me it might be a good idea to have, if t is known in advance

findall!(idxs, f, A)

Where length(idxs) == t and what would be the output of findall(f, A) is written into idxs. Then findall will always choose the resizing approach, perhaps with a keyword findall(; count_first=:auto) or something if the user desires "magic" as above for the 2n approach, which simply allocates a vector and dispatches to findall!

Also, the case of an impure predicate is probably a good candidate to add a unit test for

base/array.jl Outdated
end

findall(f::Function, A::AbstractArray{Bool}) = _findall(f, A)
findall(f::Union{typeof(!), typeof(identity)}, A::AbstractArray{Bool}) = _findall(f, A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aviatesk this seems like another case where ideally we should be dispatching on the Core.Compiler.infer_effect result for f(A) to the best choice of algorithm.

@jakobnissen jakobnissen marked this pull request as draft July 7, 2023 11:20
@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Jul 18, 2023
@StefanKarpinski
Copy link
Member

I marked this for triage commentary and the main takeaways are that everyone feels that the existing optimization is pretty icky and would love to see some benchmarks about whether it's worth it or not to ever do, especially since our vector growing code has gotten a fair amount of work in the past couple of years and might be good enough now to make this no longer worthwhile. We could also add a sizehint keyword argument that lets you specify how big you think the result would be. The optimized version could then be expressed by doing findall(pred, v, sizehint=count(pred, v)) which is kind of verbose but is at least explicit about what's being done. We could also maybe havesizehint=true mean to do this without having to spell it out.

@StefanKarpinski
Copy link
Member

I do think it's compelling that we should not evaluate predicates with side effects multiple times, we should restrict the optimization to predicates without side effects. Now that we can infer effects it seems better to just avoid multiple evaluation in cases where getindex or the predicate have side effects. TBD what effect set is allowable. Another thing that came up on the triage call is that it could be even more efficient to take a random sample of values in the array to evaluate and use that estimate a sizehint and then do the pushing implementation.

@StefanKarpinski
Copy link
Member

So my overall proposal is to add the sizehint keyword argument, which

  • defaults to false if getindex and the predicate aren't sufficiently pure
  • defaults to count(pred, v) if they are pure enough and v is small (for some cutoff)
  • defaults to a sample-based 90th percentile binomial estimate for large enough v

If the user wants to provide a size hint when getindex or the predicate aren't pure enough, then they can pass it manually and we could expose the estimator utility function.

@LilithHafner
Copy link
Member

In theory, a random sample based preallocation could be optimal, but I have yet to see that approach outperform the two-pass approach which preallocates. I propose adding the guarantee "this won't call f/getindex more than once per element", gating the preallocation optimization based on inferred effects, and keeping all those optimizations internal for now.

I have a hard time picturing a use case where it is appropriate for a user to dive deep enough into findall optimizations that it is warranted to provide a sizehint keyword so I'd like to keep the optimizations internal until someone asks to use them.

@oscardssmith
Copy link
Member

It is also probably worth keeping in mind the fact that high performance use cases will likely want to use findfirst+findnext rather than findall anyway.

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jul 22, 2023

TL;DR: We should undo all my "optimizations" and revert back to #37177

Implementations

Based on triage's comments and a discussion with @LilithHafner , I've now done some benchmarking on various implementations. It's a little tricky because the speed depends on several factors, including

  • The size of the input array
  • The dimensionality of the input array (since that changes the element type of the result)
  • The size of the output array (i.e. how often f(element) == true)
  • How long f takes to evaluate

I've tested 4 different implementations:

  1. manual_push is the naive solution that initializes a vector of indices, then pushes one index at a time, but with the added optimization that instead of actually calling push! which is slow, it @inbounds setindex! and doubles the size of the output array if necessary.
  2. The existing findall
  3. f(f, A) = findall((f.(A))::BitArray). This is the old implementation from Make findall faster for AbstractArrays #37177
  4. twopass. This implementation first converts (f, A) to an equivalent function and array where computing f(A[i]) is fast and pure. In my test implementation, I simply made it return the inputs in the cases where that was the case, and in cases where it wasn't, I returned (identity, (f.(A))::BitArray)

Note that if we take away the "ickiness" of the existing optimization, then the existing implementations are either 1. or 4. depending on dispatch.

Benchmarks

  • 1 is overall slowest in nearly every case, presumably from the many resizes and overallocation. We should not use that for A::AbstractArray. We do currently use a Base.push! equivalent for A::Any, but I think that is unavoidable. I would rather have Base.push! be made faster than do an equivalent of manual_push in that case.
  • 4 is fastest for short input arrays, and sometimes also competitve for medium-long array (~30k elements) when the result size is small. Presumably this is because it avoids allocating a large bitarray filled mostly with zeros. However, its advantage was only notable for small arrays (tens to hundreds of elements).
  • 3 is fastest for long arrays. One might think that it would be very slow for long arrays since it needs to allocate a huge bitarray filled with mostly zeros, but findall(::BitArray) is very efficient at just skipping these zeros.

Benchmarks:
Times given are relative to findall. Arguments for the 6 tests are:

[
    (!, A1),
    (identity, A2),
    (<(0.00001), A3),
    (<(0.99), A3),
    (>(0.5), A4),
    (isodd, A5)
]
Benchmarks (Click me!)

1 length = 0 Push 0.171
1 length = 0 Twopass 0.143
1 length = 0 BitVect 1.107

1 length = 32 Push 0.416
1 length = 32 Twopass 0.301
1 length = 32 BitVect 1.811

1 length = 1024 Push 1.253
1 length = 1024 Twopass 0.888
1 length = 1024 BitVect 1.77

1 length = 32224 Push 1.541
1 length = 32224 Twopass 0.934
1 length = 32224 BitVect 0.939

1 length = 1000000 Push 1.571
1 length = 1000000 Twopass 0.938
1 length = 1000000 BitVect 0.876

2 length = 0 Push 1.124
2 length = 0 Twopass 1.15
2 length = 0 BitVect 9.468

2 length = 32 Push 1.699
2 length = 32 Twopass 0.811
2 length = 32 BitVect 5.291

2 length = 1024 Push 2.436
2 length = 1024 Twopass 0.781
2 length = 1024 BitVect 1.028

2 length = 32224 Push 1.853
2 length = 32224 Twopass 0.727
2 length = 32224 BitVect 0.787

2 length = 1000000 Push 1.81
2 length = 1000000 Twopass 0.664
2 length = 1000000 BitVect 0.672

3 length = 0 Push 0.426
3 length = 0 Twopass 0.429
3 length = 0 BitVect 3.426

3 length = 32 Push 0.567
3 length = 32 Twopass 0.275
3 length = 32 BitVect 2.371

3 length = 1024 Push 1.393
3 length = 1024 Twopass 0.098
3 length = 1024 BitVect 1.253

3 length = 32224 Push 7.043
3 length = 32224 Twopass 0.318
3 length = 32224 BitVect 1.04

3 length = 1000000 Push 6.217
3 length = 1000000 Twopass 3.021
3 length = 1000000 BitVect 1.004

4 length = 0 Push 0.385
4 length = 0 Twopass 0.411
4 length = 0 BitVect 3.331

4 length = 32 Push 1.478
4 length = 32 Twopass 0.435
4 length = 32 BitVect 2.218

4 length = 1024 Push 1.292
4 length = 1024 Twopass 0.627
4 length = 1024 BitVect 1.122

4 length = 32224 Push 2.051
4 length = 32224 Twopass 0.969
4 length = 32224 BitVect 1.008

4 length = 1000000 Push 3.17
4 length = 1000000 Twopass 1.121
4 length = 1000000 BitVect 1.01

5 length = 0 Push 0.406
5 length = 0 Twopass 0.394
5 length = 0 BitVect 3.208

5 length = 32 Push 0.514
5 length = 32 Twopass 0.421
5 length = 32 BitVect 2.041

5 length = 1024 Push 1.038
5 length = 1024 Twopass 0.591
5 length = 1024 BitVect 1.115

5 length = 32224 Push 2.422
5 length = 32224 Twopass 0.959
5 length = 32224 BitVect 1.018

5 length = 1000000 Push 2.252
5 length = 1000000 Twopass 1.091
5 length = 1000000 BitVect 0.986

6 length = 0 Push 0.427
6 length = 0 Twopass 3.695
6 length = 0 BitVect 3.228

6 length = 32 Push 0.474
6 length = 32 Twopass 2.32
6 length = 32 BitVect 2.09

6 length = 1024 Push 1.302
6 length = 1024 Twopass 1.142
6 length = 1024 BitVect 1.113

6 length = 32224 Push 1.854
6 length = 32224 Twopass 1.03
6 length = 32224 BitVect 0.992

6 length = 1000000 Push 3.328
6 length = 1000000 Twopass 1.024
6 length = 1000000 BitVect 1.018

Conclusion

From my benchmarks, the conclusion is that 3. is the best overall solution, i.e. we simply revert everything back to #37177. We will regress in the sense that #42187 will re-appear, but this doesn't matter IMO. It's fine that findall is "slow" when we're talking a few hundred nanoseconds extra for a function that allocates anyway. Truly high-performance code would not use findall anyway, and it's most important to be fast for long arrays.

Q&A

What about an implementation just using push!?

Based on initial benchmarking (not shown in the benchmarking results), this version is strictly worse than implementation 1, and thus not worth considering.

What about using infer_effects and/or inlining heuristics to determine whether a two-pass solution is optimal?

Given that the two-pass solution is about on-par with the BitArray solution even in the benchmark cases where I know for sure that f is fast and pure, it's not worth it. Unless we really want to save half a microsecond for when calling findall on small inputs. If twopass was way faster than the bitarray solution, then it might have been worth investigating more closely how this could be done (#50624), but it isn't.

@jakobnissen jakobnissen changed the title Only use optimised findall for simple boolean functions Revert optimisations to findall(f, ::AbstractArray{Bool}) Jul 22, 2023
@jakobnissen jakobnissen removed the triage This should be discussed on a triage call label Jul 22, 2023
@JeffBezanson JeffBezanson marked this pull request as ready for review August 24, 2023 18:24
@JeffBezanson JeffBezanson marked this pull request as draft August 24, 2023 18:25
@jakobnissen jakobnissen force-pushed the findall branch 2 times, most recently from cae8886 to e29947c Compare April 18, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants