Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional Clinger's fast path #153

Merged
merged 15 commits into from
Nov 23, 2022
Merged

Conditional Clinger's fast path #153

merged 15 commits into from
Nov 23, 2022

Conversation

lemire
Copy link
Member

@lemire lemire commented Nov 16, 2022

As remarked by @jakubjelinek, we cannot unconditionally use the conventional Clinger's fast path because it assumes that the rounding mode is set to 'nearest' (which is the universal default). Now, nothing stops a user from changing the rounding mode, but we need our code to produce the same result (rounding to nearest) irrespective of the system's rounding mode.

Checking that fegetround() == FE_TONEAREST is too expensive on some systems.

A better solution was proposed by @mwalcott3 and involves doing an addition, a subtraction and a comparison to verify that the rounding mode is set to nearest. Because parsing a float might already involve hundreds of instructions, this check is relatively cheap. The resulting branch is also invariably well predicted.

This allows us to bring back the Clinger's fast path when the test is ok (which is always in practice). When it does not apply, we can still use another 'Clinger-like' fast path (based on a proposal by @jakubjelinek).

The result is an increased performance when the Clinger's fast path applies and the 'Clinger-like' fast path would not.

It can cause a slightly performance regression in cases (such as 'canada.txt') when the fast path is not beneficial because we now have a more expensive check, though this may depend on the compiler. There might be ways to micro-optimize this better than what the current PR proposes.

Ice Lake processor, GCC 11:

Current code:

-f data/canada.txt
fastfloat                               :   933.44 MB/s (+/- 1.4 %)    53.64 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   374.20 MB/s (+/- 1.2 %)    69.53 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   676.27 MB/s (+/- 1.7 %)    92.13 Mfloat/s  
-m uniform
fastfloat                               :  1219.35 MB/s (+/- 2.3 %)    58.12 Mfloat/s  
-m uniform -c
fastfloat                               :   951.23 MB/s (+/- 2.4 %)    54.59 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1225.69 MB/s (+/- 3.0 %)    58.42 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :   944.60 MB/s (+/- 2.1 %)    54.22 Mfloat/s  
-m simple_int32
fastfloat                               :   771.49 MB/s (+/- 3.1 %)    83.05 Mfloat/s 

This PR:

-f data/canada.txt
fastfloat                               :   953.92 MB/s (+/- 12.2 %)    54.82 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   584.63 MB/s (+/- 7.2 %)   108.62 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   809.69 MB/s (+/- 3.6 %)   110.30 Mfloat/s  
-m uniform
fastfloat                               :  1204.92 MB/s (+/- 2.1 %)    57.43 Mfloat/s  
-m uniform -c
fastfloat                               :  1025.67 MB/s (+/- 2.2 %)    58.88 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1222.26 MB/s (+/- 2.0 %)    58.26 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1019.27 MB/s (+/- 5.3 %)    58.49 Mfloat/s  
-m simple_int32
fastfloat                               :   772.84 MB/s (+/- 15.4 %)    83.20 Mfloat/s  

Using fegetround() == FE_TONEAREST to guard Clinger's (slow):

-f data/canada.txt
fastfloat                               :   778.68 MB/s (+/- 1.8 %)    44.75 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   482.55 MB/s (+/- 1.3 %)    89.66 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   631.72 MB/s (+/- 2.0 %)    86.06 Mfloat/s  
-m uniform
fastfloat                               :   915.72 MB/s (+/- 5.3 %)    43.65 Mfloat/s  
-m uniform -c
fastfloat                               :   992.28 MB/s (+/- 3.0 %)    56.94 Mfloat/s  
-m simple_uniform32
fastfloat                               :   915.51 MB/s (+/- 1.1 %)    43.64 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :   992.24 MB/s (+/- 3.2 %)    56.95 Mfloat/s  
-m simple_int32
fastfloat                               :   693.57 MB/s (+/- 3.6 %)    74.67 Mfloat/s  

Apple M2 processor, LLVM 14:

current code:

-f data/canada.txt
fastfloat                               :  1387.19 MB/s (+/- 0.7 %)    79.72 Mfloat/s      15.90 i/B   290.11 i/f (+/- 0.0 %)      2.41 c/B    43.96 c/f (+/- 0.5 %)      6.60 i/c      3.50 GHz
-f data/canada_short.txt
fastfloat                               :   615.83 MB/s (+/- 1.1 %)   114.42 Mfloat/s      37.59 i/B   212.15 i/f (+/- 0.0 %)      5.28 c/B    29.79 c/f (+/- 0.3 %)      7.12 i/c      3.41 GHz
-f data/mesh.txt
fastfloat                               :   878.58 MB/s (+/- 1.9 %)   119.69 Mfloat/s      24.01 i/B   184.79 i/f (+/- 0.0 %)      3.69 c/B    28.40 c/f (+/- 1.8 %)      6.51 i/c      3.40 GHz
-m uniform
fastfloat                               :  1894.65 MB/s (+/- 0.7 %)    90.30 Mfloat/s      12.59 i/B   277.04 i/f (+/- 0.0 %)      1.72 c/B    37.74 c/f (+/- 0.2 %)      7.34 i/c      3.41 GHz
-m uniform -c
fastfloat                               :  1404.54 MB/s (+/- 0.6 %)    80.61 Mfloat/s      13.08 i/B   239.04 i/f (+/- 0.0 %)      2.31 c/B    42.28 c/f (+/- 0.2 %)      5.65 i/c      3.41 GHz
-m simple_uniform32
fastfloat                               :  1894.15 MB/s (+/- 0.4 %)    90.28 Mfloat/s      12.59 i/B   277.04 i/f (+/- 0.0 %)      1.72 c/B    37.75 c/f (+/- 0.2 %)      7.34 i/c      3.41 GHz
-m simple_uniform32 -c
fastfloat                               :  1414.14 MB/s (+/- 1.4 %)    81.16 Mfloat/s      13.08 i/B   239.01 i/f (+/- 0.0 %)      2.30 c/B    42.00 c/f (+/- 1.0 %)      5.69 i/c      3.41 GHz
-m simple_int32
fastfloat                               :   998.10 MB/s (+/- 0.8 %)   107.46 Mfloat/s      20.17 i/B   196.39 i/f (+/- 0.0 %)      3.26 c/B    31.72 c/f (+/- 0.2 %)      6.19 i/c      3.41 GHz

this PR:

-f data/canada.txt
fastfloat                               :  1312.35 MB/s (+/- 3.7 %)    75.42 Mfloat/s      16.71 i/B   304.92 i/f (+/- 0.0 %)      2.52 c/B    46.06 c/f (+/- 1.0 %)      6.62 i/c      3.47 GHz
-f data/canada_short.txt
fastfloat                               :   764.14 MB/s (+/- 3.6 %)   141.98 Mfloat/s      30.64 i/B   172.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.4 %)      7.07 i/c      3.48 GHz
-f data/mesh.txt
fastfloat                               :   964.26 MB/s (+/- 1.3 %)   131.36 Mfloat/s      22.32 i/B   171.83 i/f (+/- 0.0 %)      3.34 c/B    25.72 c/f (+/- 1.8 %)      6.68 i/c      3.38 GHz
-m uniform
fastfloat                               :  1784.08 MB/s (+/- 0.8 %)    85.03 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.08 c/f (+/- 0.1 %)      7.41 i/c      3.41 GHz
-m uniform -c
fastfloat                               :  1409.95 MB/s (+/- 1.2 %)    80.93 Mfloat/s      12.03 i/B   219.84 i/f (+/- 0.0 %)      2.31 c/B    42.11 c/f (+/- 0.9 %)      5.22 i/c      3.41 GHz
-m simple_uniform32
fastfloat                               :  1783.64 MB/s (+/- 0.4 %)    85.01 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.09 c/f (+/- 0.2 %)      7.41 i/c      3.41 GHz
-m simple_uniform32 -c
fastfloat                               :  1409.90 MB/s (+/- 1.2 %)    80.91 Mfloat/s      12.03 i/B   219.90 i/f (+/- 0.0 %)      2.31 c/B    42.13 c/f (+/- 0.8 %)      5.22 i/c      3.41 GHz
-m simple_int32
fastfloat                               :   969.15 MB/s (+/- 0.8 %)   104.31 Mfloat/s      21.09 i/B   205.43 i/f (+/- 0.0 %)      3.35 c/B    32.67 c/f (+/- 0.2 %)      6.29 i/c      3.41 GHz

cc @jakubjelinek
credit to @mwalcott3

@lemire
Copy link
Member Author

lemire commented Nov 16, 2022

I think I found an issue in ClangCL + Win32 whereas double(0) becomes -0.0 when fegetround() == FE_DOWNWARD.

I suppose that it could be argued to be correct in some sense.

@jakubjelinek
Copy link

Wouldn't it be better to perform most or all of the old/new CLinger's fast path checks first and only if they are all satisfied
call the function to check rounding mode through the floating point ops? The limited fast path that works in all rounding modes ought to be a subset of the other.
Of course, only if it improves the benchmarks.

@lemire
Copy link
Member Author

lemire commented Nov 16, 2022

@jakubjelinek

I am concerned that there might be interesting micro-optimization that I am leaving on the table. Thanks for your comment.

Ok. So currently we do this...

if(detail::rounds_to_nearest()) {
      //
      // This is where we end up all of the time
      //
      if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
} else {
      // very uncommon case (you never get here in practice)
      //
      if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) {...}
}

Are you proposing we do...

  if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) { ... }
  if(detail::rounds_to_nearest()) {
      if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
  } 

It seems to me that in the case where these fast paths do not apply, we will end up with more overhead (e.g., we still need to compute detail::rounds_to_nearest()). In the case where the modified fast path works, we save the cost of the detail::rounds_to_nearest()... but that's the only bright side...

One concern is that if you have too many paths, you may end up with more mispredicted branches. So you do not want to make the code much more branchy (that is, I don't want to add 'hard to predict' branches).

@lemire
Copy link
Member Author

lemire commented Nov 16, 2022

@jakubjelinek My current impression is that given good enough code generation, the detail::rounds_to_nearest() call ought to be free. It should always return true, and it is only a handful of instructions, so only a few percent of our instruction count, and there is no data dependency.

@jakubjelinek
Copy link

Because && and || are short-circuiting, general rule is that if condition && or || subexpressions should be sorted from cheapest to most expensive or from the one which will short-circuit most often.

So, when you are micro-optimizing, which all this is about, if in real-world usages strings that can't use either forms of the Clinger's fast path are common, doing detail::rounds_to_nearest() as the first check (so unconditionally) might not be best.

What I thought about is roughly (untested);

  if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && !pns.too_many_digits) {
    // Unfortunately, the conventional Clinger's fast path is only possible
    // when the system rounds to the nearest float.
    if(detail::rounds_to_nearest())  {
      // We have that fegetround() == FE_TONEAREST.
      // Next is Clinger's fast path.
      if (pns.mantissa <=binary_format<T>::max_mantissa_fast_path()) {
        value = T(pns.mantissa);
        if (pns.exponent < 0) { value = value / binary_format<T>::exact_power_of_ten(-pns.exponent); }
        else { value = value * binary_format<T>::exact_power_of_ten(pns.exponent); }
        if (pns.negative) { value = -value; }
        return answer;
      }
    } else {
      // We do not have that fegetround() == FE_TONEAREST.
      // Next is a modified Clinger's fast path, inspired by Jakub Jelínek's proposal
      if (pns.exponent >= 0 && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent)) {
#if (defined(_WIN32) && defined(__clang__))
        // ClangCL may map 0 to -0.0 when fegetround() == FE_DOWNWARD
        if(pns.mantissa == 0) {
          value = 0;
          return answer;
        }
#endif
        value = T(pns.mantissa) * binary_format<T>::exact_power_of_ten(pns.exponent);
        if (pns.negative) { value = -value; }
        return answer;
    }
  }

Can you benchmark that against your version?

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

It looks like it might be very slightly faster.

Test on an Apple M2 processor (LLVM 14).

Current PR:

-f data/canada.txt
fastfloat                               :  1312.83 MB/s (+/- 0.6 %)    75.44 Mfloat/s      16.71 i/B   304.92 i/f (+/- 0.0 %)      2.55 c/B    46.44 c/f (+/- 0.2 %)      6.57 i/c      3.50 GHz 
-f data/canada_short.txt
fastfloat                               :   770.62 MB/s (+/- 0.4 %)   143.18 Mfloat/s      30.64 i/B   172.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.2 %)      7.07 i/c      3.50 GHz 
-f data/mesh.txt
fastfloat                               :   969.71 MB/s (+/- 1.8 %)   132.10 Mfloat/s      22.32 i/B   171.83 i/f (+/- 0.0 %)      3.35 c/B    25.80 c/f (+/- 1.6 %)      6.66 i/c      3.41 GHz 
-m uniform
fastfloat                               :  1833.26 MB/s (+/- 4.1 %)    87.38 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.08 c/f (+/- 0.3 %)      7.41 i/c      3.50 GHz 
-m uniform -c
fastfloat                               :  1393.32 MB/s (+/- 1.0 %)    79.97 Mfloat/s      12.04 i/B   219.92 i/f (+/- 0.0 %)      2.33 c/B    42.62 c/f (+/- 0.3 %)      5.16 i/c      3.41 GHz 
-m simple_uniform32
fastfloat                               :  1834.72 MB/s (+/- 3.8 %)    87.45 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.07 c/f (+/- 0.3 %)      7.41 i/c      3.50 GHz 
-m simple_uniform32 -c
fastfloat                               :  1437.16 MB/s (+/- 3.1 %)    82.49 Mfloat/s      12.04 i/B   219.87 i/f (+/- 0.0 %)      2.31 c/B    42.14 c/f (+/- 1.0 %)      5.22 i/c      3.48 GHz 
-m simple_int32
fastfloat                               :   969.05 MB/s (+/- 1.2 %)   104.30 Mfloat/s      21.09 i/B   205.43 i/f (+/- 0.0 %)      3.35 c/B    32.68 c/f (+/- 0.2 %)      6.29 i/c      3.41 GHz 

New proposal:

-f data/canada.txt
fastfloat                               :  1370.11 MB/s (+/- 0.3 %)    78.74 Mfloat/s      16.45 i/B   300.18 i/f (+/- 0.0 %)      2.44 c/B    44.50 c/f (+/- 0.2 %)      6.75 i/c      3.50 GHz 
-f data/canada_short.txt
fastfloat                               :   770.58 MB/s (+/- 3.5 %)   143.17 Mfloat/s      30.29 i/B   170.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.3 %)      6.98 i/c      3.50 GHz 
-f data/mesh.txt
fastfloat                               :   964.19 MB/s (+/- 2.2 %)   131.35 Mfloat/s      22.06 i/B   169.83 i/f (+/- 0.0 %)      3.33 c/B    25.67 c/f (+/- 2.0 %)      6.62 i/c      3.37 GHz 
-m uniform
fastfloat                               :  1829.66 MB/s (+/- 1.5 %)    87.21 Mfloat/s      13.27 i/B   292.04 i/f (+/- 0.0 %)      1.78 c/B    39.08 c/f (+/- 0.2 %)      7.47 i/c      3.41 GHz 
-m uniform -c
fastfloat                               :  1417.53 MB/s (+/- 0.8 %)    81.36 Mfloat/s      11.87 i/B   216.86 i/f (+/- 0.0 %)      2.29 c/B    41.89 c/f (+/- 0.3 %)      5.18 i/c      3.41 GHz 
-m simple_uniform32
fastfloat                               :  1829.92 MB/s (+/- 0.9 %)    87.22 Mfloat/s      13.27 i/B   292.03 i/f (+/- 0.0 %)      1.78 c/B    39.08 c/f (+/- 0.3 %)      7.47 i/c      3.41 GHz 
-m simple_uniform32 -c
fastfloat                               :  1410.67 MB/s (+/- 0.7 %)    80.94 Mfloat/s      11.88 i/B   217.17 i/f (+/- 0.0 %)      2.30 c/B    42.11 c/f (+/- 0.2 %)      5.16 i/c      3.41 GHz 
-m simple_int32
fastfloat                               :   973.45 MB/s (+/- 1.4 %)   104.79 Mfloat/s      20.88 i/B   203.41 i/f (+/- 0.0 %)      3.34 c/B    32.52 c/f (+/- 0.2 %)      6.25 i/c      3.41 GHz 

I will test on x64 next.

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

On an AMD Rome processor with GCC 9, the result is negative or, at least, we cannot conclude that the new approach is better.

Current PR:

-f data/canada.txt
fastfloat                               :   767.07 MB/s (+/- 0.2 %)    44.08 Mfloat/s      17.27 i/B   315.08 i/f (+/- 0.0 %)      4.21 c/B    76.84 c/f (+/- 0.2 %)      4.10 i/c      3.39 GHz
-f data/canada_short.txt
fastfloat                               :   439.85 MB/s (+/- 0.4 %)    81.72 Mfloat/s      34.12 i/B   192.58 i/f (+/- 0.0 %)      7.34 c/B    41.44 c/f (+/- 0.4 %)      4.65 i/c      3.39 GHz
-f data/mesh.txt
fastfloat                               :   587.95 MB/s (+/- 0.5 %)    80.09 Mfloat/s      24.43 i/B   188.01 i/f (+/- 0.0 %)      5.50 c/B    42.34 c/f (+/- 0.4 %)      4.44 i/c      3.39 GHz
-m uniform
fastfloat                               :   982.28 MB/s (+/- 1.0 %)    46.82 Mfloat/s      14.05 i/B   309.06 i/f (+/- 0.0 %)      3.29 c/B    72.36 c/f (+/- 0.2 %)      4.27 i/c      3.39 GHz
-m uniform -c
fastfloat                               :   873.73 MB/s (+/- 1.0 %)    50.14 Mfloat/s      13.10 i/B   239.33 i/f (+/- 0.0 %)      3.70 c/B    67.61 c/f (+/- 0.9 %)      3.54 i/c      3.39 GHz
-m simple_uniform32
fastfloat                               :   985.84 MB/s (+/- 0.3 %)    46.99 Mfloat/s      14.05 i/B   309.06 i/f (+/- 0.0 %)      3.28 c/B    72.11 c/f (+/- 0.3 %)      4.29 i/c      3.39 GHz
-m simple_uniform32 -c
fastfloat                               :   871.36 MB/s (+/- 1.0 %)    50.01 Mfloat/s      13.10 i/B   239.34 i/f (+/- 0.0 %)      3.71 c/B    67.75 c/f (+/- 0.1 %)      3.53 i/c      3.39 GHz
-m simple_int32
fastfloat                               :   564.83 MB/s (+/- 0.9 %)    60.80 Mfloat/s      23.11 i/B   225.15 i/f (+/- 0.0 %)      5.72 c/B    55.76 c/f (+/- 0.8 %)      4.04 i/c      3.39 GHz

New approach

-f data/canada.txt
fastfloat                               :   745.03 MB/s (+/- 0.4 %)    42.81 Mfloat/s      17.45 i/B   318.46 i/f (+/- 0.0 %)      4.34 c/B    79.15 c/f (+/- 0.3 %)      4.02 i/c      3.39 GHz
-f data/canada_short.txt
fastfloat                               :   425.19 MB/s (+/- 0.8 %)    79.00 Mfloat/s      33.59 i/B   189.58 i/f (+/- 0.0 %)      7.60 c/B    42.86 c/f (+/- 0.8 %)      4.42 i/c      3.39 GHz
-f data/mesh.txt
fastfloat                               :   573.93 MB/s (+/- 0.5 %)    78.18 Mfloat/s      24.04 i/B   185.01 i/f (+/- 0.0 %)      5.63 c/B    43.35 c/f (+/- 0.4 %)      4.27 i/c      3.39 GHz
-m uniform
fastfloat                               :   972.97 MB/s (+/- 0.3 %)    46.37 Mfloat/s      14.23 i/B   313.06 i/f (+/- 0.0 %)      3.32 c/B    73.08 c/f (+/- 0.3 %)      4.28 i/c      3.39 GHz
-m uniform -c
fastfloat                               :   857.85 MB/s (+/- 0.6 %)    49.23 Mfloat/s      13.06 i/B   238.65 i/f (+/- 0.0 %)      3.77 c/B    68.82 c/f (+/- 0.6 %)      3.47 i/c      3.39 GHz
-m simple_uniform32
fastfloat                               :   973.14 MB/s (+/- 0.4 %)    46.38 Mfloat/s      14.23 i/B   313.06 i/f (+/- 0.0 %)      3.31 c/B    72.91 c/f (+/- 0.3 %)      4.29 i/c      3.38 GHz
-m simple_uniform32 -c
fastfloat                               :   856.81 MB/s (+/- 0.5 %)    49.17 Mfloat/s      13.07 i/B   238.83 i/f (+/- 0.0 %)      3.77 c/B    68.91 c/f (+/- 0.5 %)      3.47 i/c      3.39 GHz
-m simple_int32
fastfloat                               :   563.99 MB/s (+/- 1.1 %)    60.71 Mfloat/s      22.81 i/B   222.15 i/f (+/- 0.0 %)      5.73 c/B    55.85 c/f (+/- 1.0 %)      3.98 i/c      3.39 GHz

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

On a graviton 3 processor with GCC 11...

This PR:

-f data/canada.txt
fastfloat                               :   937.06 MB/s (+/- 0.8 %)    53.85 Mfloat/s      13.60 i/B   248.17 i/f (+/- 0.0 %)      2.64 c/B    48.21 c/f (+/- 0.3 %)      5.15 i/c      2.60 GHz 
-f data/canada_short.txt
fastfloat                               :   592.17 MB/s (+/- 0.9 %)   110.02 Mfloat/s      24.82 i/B   140.08 i/f (+/- 0.0 %)      4.19 c/B    23.63 c/f (+/- 0.2 %)      5.93 i/c      2.60 GHz 
-f data/mesh.txt
fastfloat                               :   858.53 MB/s (+/- 1.2 %)   116.95 Mfloat/s      17.58 i/B   135.32 i/f (+/- 0.0 %)      2.89 c/B    22.22 c/f (+/- 0.5 %)      6.09 i/c      2.60 GHz 
-m uniform
fastfloat                               :  1336.57 MB/s (+/- 2.9 %)    63.70 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.86 c/B    40.81 c/f (+/- 0.2 %)      5.73 i/c      2.60 GHz 
-m uniform -c
fastfloat                               :  1159.58 MB/s (+/- 1.0 %)    66.55 Mfloat/s       9.15 i/B   167.23 i/f (+/- 0.0 %)      2.14 c/B    39.07 c/f (+/- 0.4 %)      4.28 i/c      2.60 GHz 
-m simple_uniform32
fastfloat                               :  1337.07 MB/s (+/- 0.9 %)    63.73 Mfloat/s      10.64 i/B   234.04 i/f (+/- 0.0 %)      1.85 c/B    40.80 c/f (+/- 0.1 %)      5.74 i/c      2.60 GHz 
-m simple_uniform32 -c
fastfloat                               :  1162.75 MB/s (+/- 1.4 %)    66.73 Mfloat/s       9.16 i/B   167.28 i/f (+/- 0.0 %)      2.13 c/B    38.96 c/f (+/- 0.5 %)      4.29 i/c      2.60 GHz 
-m simple_int32
fastfloat                               :   715.36 MB/s (+/- 1.2 %)    77.00 Mfloat/s      18.08 i/B   176.17 i/f (+/- 0.0 %)      3.47 c/B    33.77 c/f (+/- 0.4 %)      5.22 i/c      2.60 GHz 

With new proposal...

-f data/canada.txt
fastfloat                               :   944.94 MB/s (+/- 0.9 %)    54.30 Mfloat/s      13.59 i/B   247.99 i/f (+/- 0.0 %)      2.62 c/B    47.85 c/f (+/- 0.3 %)      5.18 i/c      2.60 GHz 
-f data/canada_short.txt
fastfloat                               :   602.92 MB/s (+/- 0.9 %)   112.02 Mfloat/s      24.47 i/B   138.08 i/f (+/- 0.0 %)      4.11 c/B    23.21 c/f (+/- 0.2 %)      5.95 i/c      2.60 GHz 
-f data/mesh.txt
fastfloat                               :   864.58 MB/s (+/- 1.2 %)   117.78 Mfloat/s      17.32 i/B   133.32 i/f (+/- 0.0 %)      2.87 c/B    22.06 c/f (+/- 0.4 %)      6.04 i/c      2.60 GHz 
-m uniform
fastfloat                               :  1341.72 MB/s (+/- 1.6 %)    63.95 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.85 c/B    40.66 c/f (+/- 0.9 %)      5.76 i/c      2.60 GHz 
-m uniform -c
fastfloat                               :  1183.55 MB/s (+/- 1.2 %)    67.91 Mfloat/s       9.08 i/B   166.00 i/f (+/- 0.0 %)      2.09 c/B    38.21 c/f (+/- 0.8 %)      4.34 i/c      2.60 GHz 
-m simple_uniform32
fastfloat                               :  1346.73 MB/s (+/- 1.9 %)    64.19 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.84 c/B    40.44 c/f (+/- 1.3 %)      5.79 i/c      2.60 GHz 
-m simple_uniform32 -c
fastfloat                               :  1182.90 MB/s (+/- 1.2 %)    67.89 Mfloat/s       9.08 i/B   165.86 i/f (+/- 0.0 %)      2.10 c/B    38.30 c/f (+/- 0.6 %)      4.33 i/c      2.60 GHz 
-m simple_int32
fastfloat                               :   723.10 MB/s (+/- 0.7 %)    77.84 Mfloat/s      17.88 i/B   174.15 i/f (+/- 0.0 %)      3.43 c/B    33.40 c/f (+/- 0.1 %)      5.21 i/c      2.60 GHz 

@jakubjelinek
Copy link

If the benchmark results are inconclusive, just pick whatever you think is more maintainable or more readable.
Perhaps testing some contemporary Intel CPU would be useful too.

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

On a small Intel Ice Lake node... GCC 11...

This PR:

-f data/canada.txt
fastfloat                               :   955.58 MB/s (+/- 1.0 %)    54.91 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   591.95 MB/s (+/- 1.0 %)   109.98 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   837.77 MB/s (+/- 5.0 %)   114.13 Mfloat/s  
-m uniform
fastfloat                               :  1204.72 MB/s (+/- 0.8 %)    57.42 Mfloat/s  
-m uniform -c
fastfloat                               :  1023.39 MB/s (+/- 2.3 %)    58.72 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1224.47 MB/s (+/- 1.2 %)    58.36 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1028.27 MB/s (+/- 2.2 %)    59.02 Mfloat/s  
-m simple_int32
fastfloat                               :   777.77 MB/s (+/- 2.2 %)    83.70 Mfloat/s  

With the proposal...

-f data/canada.txt
fastfloat                               :  1049.34 MB/s (+/- 2.3 %)    60.30 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   651.84 MB/s (+/- 1.5 %)   121.11 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   851.29 MB/s (+/- 2.7 %)   115.97 Mfloat/s  
-m uniform
fastfloat                               :  1393.54 MB/s (+/- 4.4 %)    66.42 Mfloat/s  
-m uniform -c
fastfloat                               :  1136.02 MB/s (+/- 2.6 %)    65.20 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1396.41 MB/s (+/- 4.6 %)    66.56 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1131.92 MB/s (+/- 2.1 %)    64.97 Mfloat/s  
-m simple_int32
fastfloat                               :   786.75 MB/s (+/- 2.0 %)    84.68 Mfloat/s

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

So for recent compilers and recent processors, the new proposal is beneficial. It might be slightly negative on other systems. I will adopt it.

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

@jakubjelinek I have adopted your approach.

I will leave this PR open for a little bit. I am inviting more comments.

Comment on lines 81 to 83
// There might be other ways to prevent compile-time optimizations (e.g., asm).
// The value does not need to be std::numeric_limits<float>::min(), any small
// value so that 1 + x should round to 1 would do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually isn't true. The problem is with excess precision. E.g. on i?86 32-bit (or x86-64 with -mfpmath=387) floats are evaluated to the precision of long double (with -fexcess-precision=standard) or could be either way (otherwise).
With std::numeric_limits::min() it will work in this case, because even i?86 long double has just 64-bit mantissa and because std::numeric_limits::min() is 2^-126, both will still round to nearest to 1.0.
But if the value wasn't that small, but say just 2^-60, it wouldn't be true anymore. Fortunately it would then return false and so result in the safer version despite perhaps being really FE_TONEAREST.
2^-126 is good even if some target hypothetically evaluates everything in IEEE quad (113 bit mantissa) and no hw AFAIK implements 256-bit floats right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken. I have added a remark in a follow-up commit which states "after accounting for excess precision".

Comment on lines 91 to 99
// fmin + 1.0f = 0x1.00001 (1.00001)
// 1.0f - fmin = 0x1 (1)
//
// FE_DOWNWARD or FE_TOWARDZERO:
// fmin + 1.0f = 0x1 (1)
// 1.0f - fmin = 0x0.999999 (0.999999)
//
// fmin + 1.0f = 0x1 (1)
// 1.0f - fmin = 0x0.999999 (0.999999)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float is on most CPUs IEEE single format, so in the fmin + 1.0f FE_UPWARD case it would be better
to mention:
0x1.000002
rather than
0x1.00001
because that is what one really gets in that format after that addition.
The decimal representation of that is
1.00000011920928955078125 if you want to mention it.
And 0x0.999999 is certainly wrong, the right value is
0x1.fffffe
hexadecimal and
0.999999940395355224609375
decimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us take this stuff out. It is too long.

Note that my latest commit adds an optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the explanation in the latest commit.

@lemire
Copy link
Member Author

lemire commented Nov 18, 2022

This new code is slightly insane. :-) Lots and lots of complexity. Ah well.

@biojppm
Copy link
Contributor

biojppm commented Nov 20, 2022

Would it make sense to add a test to cover what happens when -ffast-math is enabled? The relaxed rounding mode may cause the behavior to change.

@lemire
Copy link
Member Author

lemire commented Nov 20, 2022

@biojppm My understanding of fast-math is that it does away with the standards. The rounding is no longer guaranteed to be exact. And we leave the bounds of the C++ standards. I could be wrong but, if so, I’d love a reference.

@biojppm
Copy link
Contributor

biojppm commented Nov 21, 2022

Indeed -ffast-math will be a soft kind of UB for IEEE behavior. But then what kind of impact will it have, and how frequently? Will the results be wrong, or slower to come by, or both?

Eg, if the wrong answer is given, then the result would likely be that the current rounding mode is ignored (correct me if I'm wrong here), and that may or may not be bad, depending on the user's requirements.

But that is if the compiler introduces the "right" optimization. Will it? And how likely is it to do it?

If we add a test covering -ffast-math and correctness, then at least we will know some of these answers. That is, if there is a tradeoff that involves correctness or speed when -ffast-math is enabled, the users should know about it, also with some notes on the README.

And if it is evident from the outset that there is indeed a certain tradeoff, and what it looks like, then we can skip the tests, but that's even a stronger argument for making the users aware of it.

@jakubjelinek
Copy link

I have tried on godbolt
int
foo (float x)
{
return 1.0f - x == 1.0f + x;
}
with -O3 -ffast-math and all of GCC, clang, MSVC and ICC perform subtraction, addition, comparison.
Yes, in theory in such case -ffast-math with the assumption that rounding mode is to nearest and (from -ffast-math restrictions) that there are no NaNs nor Infinities and that sign of zero is unimportant perhaps it could fold that to just
return 1.0f + x == 1.0f; and avoid one operation (or return 1.0f - x == 1.0f;)., but no compiler I've tried does that right now.
That said, I'd say with -ffast-math it is also fine if from_chars doesn't round to nearest in non-default rounding modes, with -ffast-math one says that some inaccuracy is acceptable if it can result in faster code.

@lemire
Copy link
Member Author

lemire commented Nov 21, 2022

I have added a fast-math test to our CI.

@lemire
Copy link
Member Author

lemire commented Nov 23, 2022

Merging. I will issue a new release.

@lemire lemire merged commit 8f092d2 into main Nov 23, 2022
@lemire
Copy link
Member Author

lemire commented Nov 23, 2022

It has been released. Thanks to all.

@lemire lemire deleted the dlemire/renabling_clinger branch January 28, 2023 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants