-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditional Clinger's fast path #153
Conversation
I think I found an issue in ClangCL + Win32 whereas double(0) becomes -0.0 when fegetround() == FE_DOWNWARD. I suppose that it could be argued to be correct in some sense. |
Wouldn't it be better to perform most or all of the old/new CLinger's fast path checks first and only if they are all satisfied |
I am concerned that there might be interesting micro-optimization that I am leaving on the table. Thanks for your comment. Ok. So currently we do this... if(detail::rounds_to_nearest()) {
//
// This is where we end up all of the time
//
if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
} else {
// very uncommon case (you never get here in practice)
//
if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) {...}
} Are you proposing we do... if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) { ... }
if(detail::rounds_to_nearest()) {
if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
} It seems to me that in the case where these fast paths do not apply, we will end up with more overhead (e.g., we still need to compute One concern is that if you have too many paths, you may end up with more mispredicted branches. So you do not want to make the code much more branchy (that is, I don't want to add 'hard to predict' branches). |
@jakubjelinek My current impression is that given good enough code generation, the |
Because && and || are short-circuiting, general rule is that if condition && or || subexpressions should be sorted from cheapest to most expensive or from the one which will short-circuit most often. So, when you are micro-optimizing, which all this is about, if in real-world usages strings that can't use either forms of the Clinger's fast path are common, doing detail::rounds_to_nearest() as the first check (so unconditionally) might not be best. What I thought about is roughly (untested); if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && !pns.too_many_digits) {
// Unfortunately, the conventional Clinger's fast path is only possible
// when the system rounds to the nearest float.
if(detail::rounds_to_nearest()) {
// We have that fegetround() == FE_TONEAREST.
// Next is Clinger's fast path.
if (pns.mantissa <=binary_format<T>::max_mantissa_fast_path()) {
value = T(pns.mantissa);
if (pns.exponent < 0) { value = value / binary_format<T>::exact_power_of_ten(-pns.exponent); }
else { value = value * binary_format<T>::exact_power_of_ten(pns.exponent); }
if (pns.negative) { value = -value; }
return answer;
}
} else {
// We do not have that fegetround() == FE_TONEAREST.
// Next is a modified Clinger's fast path, inspired by Jakub Jelínek's proposal
if (pns.exponent >= 0 && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent)) {
#if (defined(_WIN32) && defined(__clang__))
// ClangCL may map 0 to -0.0 when fegetround() == FE_DOWNWARD
if(pns.mantissa == 0) {
value = 0;
return answer;
}
#endif
value = T(pns.mantissa) * binary_format<T>::exact_power_of_ten(pns.exponent);
if (pns.negative) { value = -value; }
return answer;
}
} Can you benchmark that against your version? |
It looks like it might be very slightly faster. Test on an Apple M2 processor (LLVM 14). Current PR:
New proposal:
I will test on x64 next. |
On an AMD Rome processor with GCC 9, the result is negative or, at least, we cannot conclude that the new approach is better. Current PR:
New approach
|
On a graviton 3 processor with GCC 11... This PR:
With new proposal...
|
If the benchmark results are inconclusive, just pick whatever you think is more maintainable or more readable. |
On a small Intel Ice Lake node... GCC 11... This PR:
With the proposal...
|
So for recent compilers and recent processors, the new proposal is beneficial. It might be slightly negative on other systems. I will adopt it. |
@jakubjelinek I have adopted your approach. I will leave this PR open for a little bit. I am inviting more comments. |
include/fast_float/parse_number.h
Outdated
// There might be other ways to prevent compile-time optimizations (e.g., asm). | ||
// The value does not need to be std::numeric_limits<float>::min(), any small | ||
// value so that 1 + x should round to 1 would do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually isn't true. The problem is with excess precision. E.g. on i?86 32-bit (or x86-64 with -mfpmath=387) floats are evaluated to the precision of long double (with -fexcess-precision=standard) or could be either way (otherwise).
With std::numeric_limits::min() it will work in this case, because even i?86 long double has just 64-bit mantissa and because std::numeric_limits::min() is 2^-126, both will still round to nearest to 1.0.
But if the value wasn't that small, but say just 2^-60, it wouldn't be true anymore. Fortunately it would then return false and so result in the safer version despite perhaps being really FE_TONEAREST.
2^-126 is good even if some target hypothetically evaluates everything in IEEE quad (113 bit mantissa) and no hw AFAIK implements 256-bit floats right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken. I have added a remark in a follow-up commit which states "after accounting for excess precision".
include/fast_float/parse_number.h
Outdated
// fmin + 1.0f = 0x1.00001 (1.00001) | ||
// 1.0f - fmin = 0x1 (1) | ||
// | ||
// FE_DOWNWARD or FE_TOWARDZERO: | ||
// fmin + 1.0f = 0x1 (1) | ||
// 1.0f - fmin = 0x0.999999 (0.999999) | ||
// | ||
// fmin + 1.0f = 0x1 (1) | ||
// 1.0f - fmin = 0x0.999999 (0.999999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float is on most CPUs IEEE single format, so in the fmin + 1.0f FE_UPWARD case it would be better
to mention:
0x1.000002
rather than
0x1.00001
because that is what one really gets in that format after that addition.
The decimal representation of that is
1.00000011920928955078125 if you want to mention it.
And 0x0.999999 is certainly wrong, the right value is
0x1.fffffe
hexadecimal and
0.999999940395355224609375
decimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us take this stuff out. It is too long.
Note that my latest commit adds an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the explanation in the latest commit.
This new code is slightly insane. :-) Lots and lots of complexity. Ah well. |
Would it make sense to add a test to cover what happens when |
@biojppm My understanding of fast-math is that it does away with the standards. The rounding is no longer guaranteed to be exact. And we leave the bounds of the C++ standards. I could be wrong but, if so, I’d love a reference. |
Indeed Eg, if the wrong answer is given, then the result would likely be that the current rounding mode is ignored (correct me if I'm wrong here), and that may or may not be bad, depending on the user's requirements. But that is if the compiler introduces the "right" optimization. Will it? And how likely is it to do it? If we add a test covering And if it is evident from the outset that there is indeed a certain tradeoff, and what it looks like, then we can skip the tests, but that's even a stronger argument for making the users aware of it. |
I have tried on godbolt |
I have added a fast-math test to our CI. |
Merging. I will issue a new release. |
It has been released. Thanks to all. |
As remarked by @jakubjelinek, we cannot unconditionally use the conventional Clinger's fast path because it assumes that the rounding mode is set to 'nearest' (which is the universal default). Now, nothing stops a user from changing the rounding mode, but we need our code to produce the same result (rounding to nearest) irrespective of the system's rounding mode.
Checking that
fegetround() == FE_TONEAREST
is too expensive on some systems.A better solution was proposed by @mwalcott3 and involves doing an addition, a subtraction and a comparison to verify that the rounding mode is set to nearest. Because parsing a float might already involve hundreds of instructions, this check is relatively cheap. The resulting branch is also invariably well predicted.
This allows us to bring back the Clinger's fast path when the test is ok (which is always in practice). When it does not apply, we can still use another 'Clinger-like' fast path (based on a proposal by @jakubjelinek).
The result is an increased performance when the Clinger's fast path applies and the 'Clinger-like' fast path would not.
It can cause a slightly performance regression in cases (such as 'canada.txt') when the fast path is not beneficial because we now have a more expensive check, though this may depend on the compiler. There might be ways to micro-optimize this better than what the current PR proposes.
Ice Lake processor, GCC 11:
Current code:
This PR:
Using
fegetround() == FE_TONEAREST
to guard Clinger's (slow):Apple M2 processor, LLVM 14:
current code:
this PR:
cc @jakubjelinek
credit to @mwalcott3