Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_Fill_memset_is_safe is a dynamic property and should work for multi-byte objects with identical bytes #3167

Open
mehrdadn opened this issue Oct 20, 2022 · 10 comments
Labels
performance Must go faster

Comments

@mehrdadn
Copy link

Hi, I just wanted to mention that _Fill_memset_is_safe "should" work for types other than char or bool. In particular, it should work for multi-byte objects whose byte representation has the same bytes repeated (with -1 being a typical case). It would be great if this could be implemented. Thanks!

STL/stl/inc/xutility

Lines 4313 to 4318 in 2f8342a

// _Fill_memset_is_safe determines if _FwdIt and _Ty are eligible for memset optimization in fill.
// Need to explicitly test for volatile because _Unwrap_enum_t discards qualifiers.
template <class _FwdIt, class _Ty, bool = _Iterator_is_contiguous<_FwdIt>>
_INLINE_VAR constexpr bool _Fill_memset_is_safe = conjunction_v<is_scalar<_Ty>,
_Is_character_or_byte_or_bool<_Unwrap_enum_t<remove_reference_t<_Iter_ref_t<_FwdIt>>>>,
negation<is_volatile<remove_reference_t<_Iter_ref_t<_FwdIt>>>>, is_assignable<_Iter_ref_t<_FwdIt>, const _Ty&>>;

@AlexGuteniev
Copy link
Contributor

Checking the value and branching could be significant pessimization.

Instead we can implement vector fill with 2, 4, and 8 byte types, like we already have vector find.

@mehrdadn
Copy link
Author

mehrdadn commented Oct 25, 2022

Sorry I'm confused, how could you avoid checking the value or branching in the 2/4/8 cases? Don't you have to ensure the value is (say) zero before you can call memset? What I was saying was that instead of checking if all the bytes are 0, you could check if all of its bytes are the same. Note the particular case that made me suggest this was integers initialized to -1, which zero-initialization wouldn't help with.

@frederick-vs-ja
Copy link
Contributor

Sorry I'm confused, how could you avoid checking the value or branching in the 2/4/8 cases? Don't you have to ensure the value is (say) zero before you can call memset?

In the 2/4/8-byte cases the intent is just not calling memset. Something like wmemset or c32memset (though this doesn't exist currently) would be called instead.

@mehrdadn
Copy link
Author

mehrdadn commented Oct 25, 2022

Oh I see. I suppose that might work? It's confusing for me why you'd do it that way though, it seems rather roundabout. Right now I see:

  • [1-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_memset(...)
  • [N-byte] if constexpr (_Fill_zero_memset_is_safe<...>) + if (_Is_all_bits_zero(...))_Fill_zero_memset(...)

The case I see missing is:

  • [N-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_memset(...)

So for N-byte objects we're already incurring a branch on _Is_all_bits_zero regardless. I would think instead of checking whether it's zero, you could check to see if all the bytes are identical. It seemed to me you could either:

  1. Remove the need for the _Fill_zero_memset_is_safe case and fold both of these into the _Fill_memset_is_safe case, since for 1-byte values there would be no branch anyway.
  2. Keep both of these and add an extra branch like else if (_Is_all_bytes_identical(...)) afterward (for arbitrary-sized objects), if you see some values to the specialized zero case.

Both of these would be much more general than the scalar case.

@CaseyCarter CaseyCarter added the performance Must go faster label Oct 25, 2022
@AlexGuteniev
Copy link
Contributor

I think we can go at least:

  • [1-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_memset(...) (existing)
  • [2-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_wmemset(...) (new)
    And possibly in addition
  • [4-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_c32memset(...) (new)
  • [8-byte] if constexpr (_Fill_memset_is_safe<...>)_Fill_c64memset(...) (new)

Though 2-byte looks more important to me, as it is about wchar_t

Any of these will not check values.


I did a similar thing here already we had find based on memchr, I expanded it to 2,4,8 byte types, I implemented my own functions for that, because:

  • There were no 4 and 8 byte version of memory find at all
  • wmemchr turned out to be very suboptimal
  • Even memchr is only SSE2 and does not use AVX2

See #2434

@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting. Vectorized algorithms to fill 1/2/4/8 bytes seem like a potentially interesting optimization. However, detecting fills of values that happen to be implementable with 1-byte memsets (e.g. filling a range of unsigned int with 0xFFFF'FFFFu) seems to be much less useful - such values are likely much rarer (compared to zero, which is an extremely common and special value), and spending extra branches to detect those cases is potentially harmful. In contrast, just vectorizing the whole fill wouldn't be value-sensitive in that way.

@AlexGuteniev
Copy link
Contributor

I've looked into the existing memset for not necessarily zero optimization.
Looks like it still misses some opportunities.

STL/stl/inc/xutility

Lines 4900 to 4903 in e6a12f7

template <class _FwdIt, class _Ty, bool = _Iterator_is_contiguous<_FwdIt>>
_INLINE_VAR constexpr bool _Fill_memset_is_safe = conjunction_v<is_scalar<_Ty>,
_Is_character_or_byte_or_bool<_Unwrap_enum_t<remove_reference_t<_Iter_ref_t<_FwdIt>>>>,
negation<is_volatile<remove_reference_t<_Iter_ref_t<_FwdIt>>>>, is_assignable<_Iter_ref_t<_FwdIt>, const _Ty&>>;

Guess it will not work for enum x : char { ... } or struct y { char a: 5; char b: 3; } whereas it could.
Can we check somehow for trivial copyability?

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Apr 8, 2023

Guess it will not work for enum x : char { ... } or struct y { char a: 5; char b: 3; } whereas it could. Can we check somehow for trivial copyability?

I guess is_trivially_assignable_v is the way to go

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Apr 8, 2023
@AlexGuteniev
Copy link
Contributor

x86 / x64 implementation of c32memset / c64memset is trivial: it is __stosd / __stosq

@AlexGuteniev
Copy link
Contributor

x86 / x64 implementation of c32memset / c64memset is trivial: it is __stosd / __stosq

It is however possible that SSE or AVX impl could be faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants