You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For contiguous ranges, simple types (1,2,4,8 byte integers, maybe also 4,8 bytes float in fast mode) the following vector algorithm is possible (assuming SSE2 and 8-bit type, but applicable to other sizes/vector sizes):
Spread the value to a vector register (_mm_set1 intrinsics)
Obtain matched bitmask (_mm_cmpeq_epi8 intrinsic)
Get mask as bits (_mm_movemask_epi8) , add them up (_popcnt)
Accumulate this result.
Probably hand-coded popcount will be inefficient, in this case can apply starting SSE4.2, for which we assume popcnt available.
The text was updated successfully, but these errors were encountered:
#include <algorithm>
#include <array>
char s[] = "the quick brown fox jumps over the lazy dog";
int foxes()
{
return std::count(std::begin(s), std::end(s), 'o');
}
https://godbolt.org/z/6KeYT4YaG
clang uses my proposal in the generated code
gcc does something which I don't understand, but like less (edit: figured out, yes, it is stupid vectorization, though could beat unvectorized)
MSVC currently naively counts bytes, so library optimization is indeed helpful
Relates to #2379
For contiguous ranges, simple types (1,2,4,8 byte integers, maybe also 4,8 bytes float in fast mode) the following vector algorithm is possible (assuming SSE2 and 8-bit type, but applicable to other sizes/vector sizes):
Spread the value to a vector register (
_mm_set1
intrinsics)Obtain matched bitmask (
_mm_cmpeq_epi8
intrinsic)Get mask as bits (
_mm_movemask_epi8
) , add them up (_popcnt
)Accumulate this result.
Probably hand-coded
popcount
will be inefficient, in this case can apply starting SSE4.2, for which we assumepopcnt
available.The text was updated successfully, but these errors were encountered: