Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<xutility>: vectorize std::count #2384

Closed
AlexGuteniev opened this issue Dec 9, 2021 · 1 comment · Fixed by #2434
Closed

<xutility>: vectorize std::count #2384

AlexGuteniev opened this issue Dec 9, 2021 · 1 comment · Fixed by #2434
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Dec 9, 2021

Relates to #2379

For contiguous ranges, simple types (1,2,4,8 byte integers, maybe also 4,8 bytes float in fast mode) the following vector algorithm is possible (assuming SSE2 and 8-bit type, but applicable to other sizes/vector sizes):

Spread the value to a vector register (_mm_set1 intrinsics)
Obtain matched bitmask (_mm_cmpeq_epi8 intrinsic)
Get mask as bits (_mm_movemask_epi8) , add them up (_popcnt)
Accumulate this result.
Probably hand-coded popcount will be inefficient, in this case can apply starting SSE4.2, for which we assume popcnt available.

@CaseyCarter CaseyCarter added the performance Must go faster label Dec 9, 2021
@CaseyCarter CaseyCarter changed the title <xutulity>: optimize std::count <xutulity>: vectorize std::count Dec 9, 2021
@StephanTLavavej StephanTLavavej changed the title <xutulity>: vectorize std::count <xutility>: vectorize std::count Dec 11, 2021
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Dec 12, 2021

#include <algorithm>
#include <array>

char s[] = "the quick brown fox jumps over the lazy dog";

int foxes()
{
    return std::count(std::begin(s), std::end(s), 'o');
}

https://godbolt.org/z/6KeYT4YaG
clang uses my proposal in the generated code
gcc does something which I don't understand, but like less (edit: figured out, yes, it is stupid vectorization, though could beat unvectorized)
MSVC currently naively counts bytes, so library optimization is indeed helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants