Conditional Clinger's fast path #153

lemire · 2022-11-16T17:25:07Z

As remarked by @jakubjelinek, we cannot unconditionally use the conventional Clinger's fast path because it assumes that the rounding mode is set to 'nearest' (which is the universal default). Now, nothing stops a user from changing the rounding mode, but we need our code to produce the same result (rounding to nearest) irrespective of the system's rounding mode.

Checking that fegetround() == FE_TONEAREST is too expensive on some systems.

A better solution was proposed by @mwalcott3 and involves doing an addition, a subtraction and a comparison to verify that the rounding mode is set to nearest. Because parsing a float might already involve hundreds of instructions, this check is relatively cheap. The resulting branch is also invariably well predicted.

This allows us to bring back the Clinger's fast path when the test is ok (which is always in practice). When it does not apply, we can still use another 'Clinger-like' fast path (based on a proposal by @jakubjelinek).

The result is an increased performance when the Clinger's fast path applies and the 'Clinger-like' fast path would not.

It can cause a slightly performance regression in cases (such as 'canada.txt') when the fast path is not beneficial because we now have a more expensive check, though this may depend on the compiler. There might be ways to micro-optimize this better than what the current PR proposes.

Ice Lake processor, GCC 11:

Current code:

-f data/canada.txt
fastfloat                               :   933.44 MB/s (+/- 1.4 %)    53.64 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   374.20 MB/s (+/- 1.2 %)    69.53 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   676.27 MB/s (+/- 1.7 %)    92.13 Mfloat/s  
-m uniform
fastfloat                               :  1219.35 MB/s (+/- 2.3 %)    58.12 Mfloat/s  
-m uniform -c
fastfloat                               :   951.23 MB/s (+/- 2.4 %)    54.59 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1225.69 MB/s (+/- 3.0 %)    58.42 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :   944.60 MB/s (+/- 2.1 %)    54.22 Mfloat/s  
-m simple_int32
fastfloat                               :   771.49 MB/s (+/- 3.1 %)    83.05 Mfloat/s

This PR:

-f data/canada.txt
fastfloat                               :   953.92 MB/s (+/- 12.2 %)    54.82 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   584.63 MB/s (+/- 7.2 %)   108.62 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   809.69 MB/s (+/- 3.6 %)   110.30 Mfloat/s  
-m uniform
fastfloat                               :  1204.92 MB/s (+/- 2.1 %)    57.43 Mfloat/s  
-m uniform -c
fastfloat                               :  1025.67 MB/s (+/- 2.2 %)    58.88 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1222.26 MB/s (+/- 2.0 %)    58.26 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1019.27 MB/s (+/- 5.3 %)    58.49 Mfloat/s  
-m simple_int32
fastfloat                               :   772.84 MB/s (+/- 15.4 %)    83.20 Mfloat/s

Using fegetround() == FE_TONEAREST to guard Clinger's (slow):

-f data/canada.txt
fastfloat                               :   778.68 MB/s (+/- 1.8 %)    44.75 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   482.55 MB/s (+/- 1.3 %)    89.66 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   631.72 MB/s (+/- 2.0 %)    86.06 Mfloat/s  
-m uniform
fastfloat                               :   915.72 MB/s (+/- 5.3 %)    43.65 Mfloat/s  
-m uniform -c
fastfloat                               :   992.28 MB/s (+/- 3.0 %)    56.94 Mfloat/s  
-m simple_uniform32
fastfloat                               :   915.51 MB/s (+/- 1.1 %)    43.64 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :   992.24 MB/s (+/- 3.2 %)    56.95 Mfloat/s  
-m simple_int32
fastfloat                               :   693.57 MB/s (+/- 3.6 %)    74.67 Mfloat/s

Apple M2 processor, LLVM 14:

current code:

-f data/canada.txt
fastfloat                               :  1387.19 MB/s (+/- 0.7 %)    79.72 Mfloat/s      15.90 i/B   290.11 i/f (+/- 0.0 %)      2.41 c/B    43.96 c/f (+/- 0.5 %)      6.60 i/c      3.50 GHz
-f data/canada_short.txt
fastfloat                               :   615.83 MB/s (+/- 1.1 %)   114.42 Mfloat/s      37.59 i/B   212.15 i/f (+/- 0.0 %)      5.28 c/B    29.79 c/f (+/- 0.3 %)      7.12 i/c      3.41 GHz
-f data/mesh.txt
fastfloat                               :   878.58 MB/s (+/- 1.9 %)   119.69 Mfloat/s      24.01 i/B   184.79 i/f (+/- 0.0 %)      3.69 c/B    28.40 c/f (+/- 1.8 %)      6.51 i/c      3.40 GHz
-m uniform
fastfloat                               :  1894.65 MB/s (+/- 0.7 %)    90.30 Mfloat/s      12.59 i/B   277.04 i/f (+/- 0.0 %)      1.72 c/B    37.74 c/f (+/- 0.2 %)      7.34 i/c      3.41 GHz
-m uniform -c
fastfloat                               :  1404.54 MB/s (+/- 0.6 %)    80.61 Mfloat/s      13.08 i/B   239.04 i/f (+/- 0.0 %)      2.31 c/B    42.28 c/f (+/- 0.2 %)      5.65 i/c      3.41 GHz
-m simple_uniform32
fastfloat                               :  1894.15 MB/s (+/- 0.4 %)    90.28 Mfloat/s      12.59 i/B   277.04 i/f (+/- 0.0 %)      1.72 c/B    37.75 c/f (+/- 0.2 %)      7.34 i/c      3.41 GHz
-m simple_uniform32 -c
fastfloat                               :  1414.14 MB/s (+/- 1.4 %)    81.16 Mfloat/s      13.08 i/B   239.01 i/f (+/- 0.0 %)      2.30 c/B    42.00 c/f (+/- 1.0 %)      5.69 i/c      3.41 GHz
-m simple_int32
fastfloat                               :   998.10 MB/s (+/- 0.8 %)   107.46 Mfloat/s      20.17 i/B   196.39 i/f (+/- 0.0 %)      3.26 c/B    31.72 c/f (+/- 0.2 %)      6.19 i/c      3.41 GHz

this PR:

-f data/canada.txt
fastfloat                               :  1312.35 MB/s (+/- 3.7 %)    75.42 Mfloat/s      16.71 i/B   304.92 i/f (+/- 0.0 %)      2.52 c/B    46.06 c/f (+/- 1.0 %)      6.62 i/c      3.47 GHz
-f data/canada_short.txt
fastfloat                               :   764.14 MB/s (+/- 3.6 %)   141.98 Mfloat/s      30.64 i/B   172.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.4 %)      7.07 i/c      3.48 GHz
-f data/mesh.txt
fastfloat                               :   964.26 MB/s (+/- 1.3 %)   131.36 Mfloat/s      22.32 i/B   171.83 i/f (+/- 0.0 %)      3.34 c/B    25.72 c/f (+/- 1.8 %)      6.68 i/c      3.38 GHz
-m uniform
fastfloat                               :  1784.08 MB/s (+/- 0.8 %)    85.03 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.08 c/f (+/- 0.1 %)      7.41 i/c      3.41 GHz
-m uniform -c
fastfloat                               :  1409.95 MB/s (+/- 1.2 %)    80.93 Mfloat/s      12.03 i/B   219.84 i/f (+/- 0.0 %)      2.31 c/B    42.11 c/f (+/- 0.9 %)      5.22 i/c      3.41 GHz
-m simple_uniform32
fastfloat                               :  1783.64 MB/s (+/- 0.4 %)    85.01 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.09 c/f (+/- 0.2 %)      7.41 i/c      3.41 GHz
-m simple_uniform32 -c
fastfloat                               :  1409.90 MB/s (+/- 1.2 %)    80.91 Mfloat/s      12.03 i/B   219.90 i/f (+/- 0.0 %)      2.31 c/B    42.13 c/f (+/- 0.8 %)      5.22 i/c      3.41 GHz
-m simple_int32
fastfloat                               :   969.15 MB/s (+/- 0.8 %)   104.31 Mfloat/s      21.09 i/B   205.43 i/f (+/- 0.0 %)      3.35 c/B    32.67 c/f (+/- 0.2 %)      6.29 i/c      3.41 GHz

cc @jakubjelinek
credit to @mwalcott3

lemire · 2022-11-16T19:38:22Z

I think I found an issue in ClangCL + Win32 whereas double(0) becomes -0.0 when fegetround() == FE_DOWNWARD.

I suppose that it could be argued to be correct in some sense.

jakubjelinek · 2022-11-16T20:20:27Z

Wouldn't it be better to perform most or all of the old/new CLinger's fast path checks first and only if they are all satisfied
call the function to check rounding mode through the floating point ops? The limited fast path that works in all rounding modes ought to be a subset of the other.
Of course, only if it improves the benchmarks.

lemire · 2022-11-16T20:36:53Z

@jakubjelinek

I am concerned that there might be interesting micro-optimization that I am leaving on the table. Thanks for your comment.

Ok. So currently we do this...

if(detail::rounds_to_nearest()) {
      //
      // This is where we end up all of the time
      //
      if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
} else {
      // very uncommon case (you never get here in practice)
      //
      if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) {...}
}

Are you proposing we do...

  if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent) && !pns.too_many_digits) { ... }
  if(detail::rounds_to_nearest()) {
      if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) { ... }
  }

It seems to me that in the case where these fast paths do not apply, we will end up with more overhead (e.g., we still need to compute detail::rounds_to_nearest()). In the case where the modified fast path works, we save the cost of the detail::rounds_to_nearest()... but that's the only bright side...

One concern is that if you have too many paths, you may end up with more mispredicted branches. So you do not want to make the code much more branchy (that is, I don't want to add 'hard to predict' branches).

lemire · 2022-11-16T20:44:50Z

@jakubjelinek My current impression is that given good enough code generation, the detail::rounds_to_nearest() call ought to be free. It should always return true, and it is only a handful of instructions, so only a few percent of our instruction count, and there is no data dependency.

…g easier)

jakubjelinek · 2022-11-18T10:38:15Z

Because && and || are short-circuiting, general rule is that if condition && or || subexpressions should be sorted from cheapest to most expensive or from the one which will short-circuit most often.

So, when you are micro-optimizing, which all this is about, if in real-world usages strings that can't use either forms of the Clinger's fast path are common, doing detail::rounds_to_nearest() as the first check (so unconditionally) might not be best.

What I thought about is roughly (untested);

  if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && !pns.too_many_digits) {
    // Unfortunately, the conventional Clinger's fast path is only possible
    // when the system rounds to the nearest float.
    if(detail::rounds_to_nearest())  {
      // We have that fegetround() == FE_TONEAREST.
      // Next is Clinger's fast path.
      if (pns.mantissa <=binary_format<T>::max_mantissa_fast_path()) {
        value = T(pns.mantissa);
        if (pns.exponent < 0) { value = value / binary_format<T>::exact_power_of_ten(-pns.exponent); }
        else { value = value * binary_format<T>::exact_power_of_ten(pns.exponent); }
        if (pns.negative) { value = -value; }
        return answer;
      }
    } else {
      // We do not have that fegetround() == FE_TONEAREST.
      // Next is a modified Clinger's fast path, inspired by Jakub Jelínek's proposal
      if (pns.exponent >= 0 && pns.mantissa <=binary_format<T>::max_mantissa_fast_path(pns.exponent)) {
#if (defined(_WIN32) && defined(__clang__))
        // ClangCL may map 0 to -0.0 when fegetround() == FE_DOWNWARD
        if(pns.mantissa == 0) {
          value = 0;
          return answer;
        }
#endif
        value = T(pns.mantissa) * binary_format<T>::exact_power_of_ten(pns.exponent);
        if (pns.negative) { value = -value; }
        return answer;
    }
  }

Can you benchmark that against your version?

lemire · 2022-11-18T15:08:02Z

It looks like it might be very slightly faster.

Test on an Apple M2 processor (LLVM 14).

Current PR:

-f data/canada.txt
fastfloat                               :  1312.83 MB/s (+/- 0.6 %)    75.44 Mfloat/s      16.71 i/B   304.92 i/f (+/- 0.0 %)      2.55 c/B    46.44 c/f (+/- 0.2 %)      6.57 i/c      3.50 GHz 
-f data/canada_short.txt
fastfloat                               :   770.62 MB/s (+/- 0.4 %)   143.18 Mfloat/s      30.64 i/B   172.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.2 %)      7.07 i/c      3.50 GHz 
-f data/mesh.txt
fastfloat                               :   969.71 MB/s (+/- 1.8 %)   132.10 Mfloat/s      22.32 i/B   171.83 i/f (+/- 0.0 %)      3.35 c/B    25.80 c/f (+/- 1.6 %)      6.66 i/c      3.41 GHz 
-m uniform
fastfloat                               :  1833.26 MB/s (+/- 4.1 %)    87.38 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.08 c/f (+/- 0.3 %)      7.41 i/c      3.50 GHz 
-m uniform -c
fastfloat                               :  1393.32 MB/s (+/- 1.0 %)    79.97 Mfloat/s      12.04 i/B   219.92 i/f (+/- 0.0 %)      2.33 c/B    42.62 c/f (+/- 0.3 %)      5.16 i/c      3.41 GHz 
-m simple_uniform32
fastfloat                               :  1834.72 MB/s (+/- 3.8 %)    87.45 Mfloat/s      13.50 i/B   297.04 i/f (+/- 0.0 %)      1.82 c/B    40.07 c/f (+/- 0.3 %)      7.41 i/c      3.50 GHz 
-m simple_uniform32 -c
fastfloat                               :  1437.16 MB/s (+/- 3.1 %)    82.49 Mfloat/s      12.04 i/B   219.87 i/f (+/- 0.0 %)      2.31 c/B    42.14 c/f (+/- 1.0 %)      5.22 i/c      3.48 GHz 
-m simple_int32
fastfloat                               :   969.05 MB/s (+/- 1.2 %)   104.30 Mfloat/s      21.09 i/B   205.43 i/f (+/- 0.0 %)      3.35 c/B    32.68 c/f (+/- 0.2 %)      6.29 i/c      3.41 GHz

New proposal:

-f data/canada.txt
fastfloat                               :  1370.11 MB/s (+/- 0.3 %)    78.74 Mfloat/s      16.45 i/B   300.18 i/f (+/- 0.0 %)      2.44 c/B    44.50 c/f (+/- 0.2 %)      6.75 i/c      3.50 GHz 
-f data/canada_short.txt
fastfloat                               :   770.58 MB/s (+/- 3.5 %)   143.17 Mfloat/s      30.29 i/B   170.94 i/f (+/- 0.0 %)      4.34 c/B    24.48 c/f (+/- 0.3 %)      6.98 i/c      3.50 GHz 
-f data/mesh.txt
fastfloat                               :   964.19 MB/s (+/- 2.2 %)   131.35 Mfloat/s      22.06 i/B   169.83 i/f (+/- 0.0 %)      3.33 c/B    25.67 c/f (+/- 2.0 %)      6.62 i/c      3.37 GHz 
-m uniform
fastfloat                               :  1829.66 MB/s (+/- 1.5 %)    87.21 Mfloat/s      13.27 i/B   292.04 i/f (+/- 0.0 %)      1.78 c/B    39.08 c/f (+/- 0.2 %)      7.47 i/c      3.41 GHz 
-m uniform -c
fastfloat                               :  1417.53 MB/s (+/- 0.8 %)    81.36 Mfloat/s      11.87 i/B   216.86 i/f (+/- 0.0 %)      2.29 c/B    41.89 c/f (+/- 0.3 %)      5.18 i/c      3.41 GHz 
-m simple_uniform32
fastfloat                               :  1829.92 MB/s (+/- 0.9 %)    87.22 Mfloat/s      13.27 i/B   292.03 i/f (+/- 0.0 %)      1.78 c/B    39.08 c/f (+/- 0.3 %)      7.47 i/c      3.41 GHz 
-m simple_uniform32 -c
fastfloat                               :  1410.67 MB/s (+/- 0.7 %)    80.94 Mfloat/s      11.88 i/B   217.17 i/f (+/- 0.0 %)      2.30 c/B    42.11 c/f (+/- 0.2 %)      5.16 i/c      3.41 GHz 
-m simple_int32
fastfloat                               :   973.45 MB/s (+/- 1.4 %)   104.79 Mfloat/s      20.88 i/B   203.41 i/f (+/- 0.0 %)      3.34 c/B    32.52 c/f (+/- 0.2 %)      6.25 i/c      3.41 GHz

I will test on x64 next.

lemire · 2022-11-18T15:25:35Z

On an AMD Rome processor with GCC 9, the result is negative or, at least, we cannot conclude that the new approach is better.

Current PR:

-f data/canada.txt
fastfloat                               :   767.07 MB/s (+/- 0.2 %)    44.08 Mfloat/s      17.27 i/B   315.08 i/f (+/- 0.0 %)      4.21 c/B    76.84 c/f (+/- 0.2 %)      4.10 i/c      3.39 GHz
-f data/canada_short.txt
fastfloat                               :   439.85 MB/s (+/- 0.4 %)    81.72 Mfloat/s      34.12 i/B   192.58 i/f (+/- 0.0 %)      7.34 c/B    41.44 c/f (+/- 0.4 %)      4.65 i/c      3.39 GHz
-f data/mesh.txt
fastfloat                               :   587.95 MB/s (+/- 0.5 %)    80.09 Mfloat/s      24.43 i/B   188.01 i/f (+/- 0.0 %)      5.50 c/B    42.34 c/f (+/- 0.4 %)      4.44 i/c      3.39 GHz
-m uniform
fastfloat                               :   982.28 MB/s (+/- 1.0 %)    46.82 Mfloat/s      14.05 i/B   309.06 i/f (+/- 0.0 %)      3.29 c/B    72.36 c/f (+/- 0.2 %)      4.27 i/c      3.39 GHz
-m uniform -c
fastfloat                               :   873.73 MB/s (+/- 1.0 %)    50.14 Mfloat/s      13.10 i/B   239.33 i/f (+/- 0.0 %)      3.70 c/B    67.61 c/f (+/- 0.9 %)      3.54 i/c      3.39 GHz
-m simple_uniform32
fastfloat                               :   985.84 MB/s (+/- 0.3 %)    46.99 Mfloat/s      14.05 i/B   309.06 i/f (+/- 0.0 %)      3.28 c/B    72.11 c/f (+/- 0.3 %)      4.29 i/c      3.39 GHz
-m simple_uniform32 -c
fastfloat                               :   871.36 MB/s (+/- 1.0 %)    50.01 Mfloat/s      13.10 i/B   239.34 i/f (+/- 0.0 %)      3.71 c/B    67.75 c/f (+/- 0.1 %)      3.53 i/c      3.39 GHz
-m simple_int32
fastfloat                               :   564.83 MB/s (+/- 0.9 %)    60.80 Mfloat/s      23.11 i/B   225.15 i/f (+/- 0.0 %)      5.72 c/B    55.76 c/f (+/- 0.8 %)      4.04 i/c      3.39 GHz

New approach

-f data/canada.txt
fastfloat                               :   745.03 MB/s (+/- 0.4 %)    42.81 Mfloat/s      17.45 i/B   318.46 i/f (+/- 0.0 %)      4.34 c/B    79.15 c/f (+/- 0.3 %)      4.02 i/c      3.39 GHz
-f data/canada_short.txt
fastfloat                               :   425.19 MB/s (+/- 0.8 %)    79.00 Mfloat/s      33.59 i/B   189.58 i/f (+/- 0.0 %)      7.60 c/B    42.86 c/f (+/- 0.8 %)      4.42 i/c      3.39 GHz
-f data/mesh.txt
fastfloat                               :   573.93 MB/s (+/- 0.5 %)    78.18 Mfloat/s      24.04 i/B   185.01 i/f (+/- 0.0 %)      5.63 c/B    43.35 c/f (+/- 0.4 %)      4.27 i/c      3.39 GHz
-m uniform
fastfloat                               :   972.97 MB/s (+/- 0.3 %)    46.37 Mfloat/s      14.23 i/B   313.06 i/f (+/- 0.0 %)      3.32 c/B    73.08 c/f (+/- 0.3 %)      4.28 i/c      3.39 GHz
-m uniform -c
fastfloat                               :   857.85 MB/s (+/- 0.6 %)    49.23 Mfloat/s      13.06 i/B   238.65 i/f (+/- 0.0 %)      3.77 c/B    68.82 c/f (+/- 0.6 %)      3.47 i/c      3.39 GHz
-m simple_uniform32
fastfloat                               :   973.14 MB/s (+/- 0.4 %)    46.38 Mfloat/s      14.23 i/B   313.06 i/f (+/- 0.0 %)      3.31 c/B    72.91 c/f (+/- 0.3 %)      4.29 i/c      3.38 GHz
-m simple_uniform32 -c
fastfloat                               :   856.81 MB/s (+/- 0.5 %)    49.17 Mfloat/s      13.07 i/B   238.83 i/f (+/- 0.0 %)      3.77 c/B    68.91 c/f (+/- 0.5 %)      3.47 i/c      3.39 GHz
-m simple_int32
fastfloat                               :   563.99 MB/s (+/- 1.1 %)    60.71 Mfloat/s      22.81 i/B   222.15 i/f (+/- 0.0 %)      5.73 c/B    55.85 c/f (+/- 1.0 %)      3.98 i/c      3.39 GHz

lemire · 2022-11-18T15:51:41Z

On a graviton 3 processor with GCC 11...

This PR:

-f data/canada.txt
fastfloat                               :   937.06 MB/s (+/- 0.8 %)    53.85 Mfloat/s      13.60 i/B   248.17 i/f (+/- 0.0 %)      2.64 c/B    48.21 c/f (+/- 0.3 %)      5.15 i/c      2.60 GHz 
-f data/canada_short.txt
fastfloat                               :   592.17 MB/s (+/- 0.9 %)   110.02 Mfloat/s      24.82 i/B   140.08 i/f (+/- 0.0 %)      4.19 c/B    23.63 c/f (+/- 0.2 %)      5.93 i/c      2.60 GHz 
-f data/mesh.txt
fastfloat                               :   858.53 MB/s (+/- 1.2 %)   116.95 Mfloat/s      17.58 i/B   135.32 i/f (+/- 0.0 %)      2.89 c/B    22.22 c/f (+/- 0.5 %)      6.09 i/c      2.60 GHz 
-m uniform
fastfloat                               :  1336.57 MB/s (+/- 2.9 %)    63.70 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.86 c/B    40.81 c/f (+/- 0.2 %)      5.73 i/c      2.60 GHz 
-m uniform -c
fastfloat                               :  1159.58 MB/s (+/- 1.0 %)    66.55 Mfloat/s       9.15 i/B   167.23 i/f (+/- 0.0 %)      2.14 c/B    39.07 c/f (+/- 0.4 %)      4.28 i/c      2.60 GHz 
-m simple_uniform32
fastfloat                               :  1337.07 MB/s (+/- 0.9 %)    63.73 Mfloat/s      10.64 i/B   234.04 i/f (+/- 0.0 %)      1.85 c/B    40.80 c/f (+/- 0.1 %)      5.74 i/c      2.60 GHz 
-m simple_uniform32 -c
fastfloat                               :  1162.75 MB/s (+/- 1.4 %)    66.73 Mfloat/s       9.16 i/B   167.28 i/f (+/- 0.0 %)      2.13 c/B    38.96 c/f (+/- 0.5 %)      4.29 i/c      2.60 GHz 
-m simple_int32
fastfloat                               :   715.36 MB/s (+/- 1.2 %)    77.00 Mfloat/s      18.08 i/B   176.17 i/f (+/- 0.0 %)      3.47 c/B    33.77 c/f (+/- 0.4 %)      5.22 i/c      2.60 GHz

With new proposal...

-f data/canada.txt
fastfloat                               :   944.94 MB/s (+/- 0.9 %)    54.30 Mfloat/s      13.59 i/B   247.99 i/f (+/- 0.0 %)      2.62 c/B    47.85 c/f (+/- 0.3 %)      5.18 i/c      2.60 GHz 
-f data/canada_short.txt
fastfloat                               :   602.92 MB/s (+/- 0.9 %)   112.02 Mfloat/s      24.47 i/B   138.08 i/f (+/- 0.0 %)      4.11 c/B    23.21 c/f (+/- 0.2 %)      5.95 i/c      2.60 GHz 
-f data/mesh.txt
fastfloat                               :   864.58 MB/s (+/- 1.2 %)   117.78 Mfloat/s      17.32 i/B   133.32 i/f (+/- 0.0 %)      2.87 c/B    22.06 c/f (+/- 0.4 %)      6.04 i/c      2.60 GHz 
-m uniform
fastfloat                               :  1341.72 MB/s (+/- 1.6 %)    63.95 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.85 c/B    40.66 c/f (+/- 0.9 %)      5.76 i/c      2.60 GHz 
-m uniform -c
fastfloat                               :  1183.55 MB/s (+/- 1.2 %)    67.91 Mfloat/s       9.08 i/B   166.00 i/f (+/- 0.0 %)      2.09 c/B    38.21 c/f (+/- 0.8 %)      4.34 i/c      2.60 GHz 
-m simple_uniform32
fastfloat                               :  1346.73 MB/s (+/- 1.9 %)    64.19 Mfloat/s      10.64 i/B   234.05 i/f (+/- 0.0 %)      1.84 c/B    40.44 c/f (+/- 1.3 %)      5.79 i/c      2.60 GHz 
-m simple_uniform32 -c
fastfloat                               :  1182.90 MB/s (+/- 1.2 %)    67.89 Mfloat/s       9.08 i/B   165.86 i/f (+/- 0.0 %)      2.10 c/B    38.30 c/f (+/- 0.6 %)      4.33 i/c      2.60 GHz 
-m simple_int32
fastfloat                               :   723.10 MB/s (+/- 0.7 %)    77.84 Mfloat/s      17.88 i/B   174.15 i/f (+/- 0.0 %)      3.43 c/B    33.40 c/f (+/- 0.1 %)      5.21 i/c      2.60 GHz

jakubjelinek · 2022-11-18T15:59:22Z

If the benchmark results are inconclusive, just pick whatever you think is more maintainable or more readable.
Perhaps testing some contemporary Intel CPU would be useful too.

lemire · 2022-11-18T16:19:10Z

On a small Intel Ice Lake node... GCC 11...

This PR:

-f data/canada.txt
fastfloat                               :   955.58 MB/s (+/- 1.0 %)    54.91 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   591.95 MB/s (+/- 1.0 %)   109.98 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   837.77 MB/s (+/- 5.0 %)   114.13 Mfloat/s  
-m uniform
fastfloat                               :  1204.72 MB/s (+/- 0.8 %)    57.42 Mfloat/s  
-m uniform -c
fastfloat                               :  1023.39 MB/s (+/- 2.3 %)    58.72 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1224.47 MB/s (+/- 1.2 %)    58.36 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1028.27 MB/s (+/- 2.2 %)    59.02 Mfloat/s  
-m simple_int32
fastfloat                               :   777.77 MB/s (+/- 2.2 %)    83.70 Mfloat/s

With the proposal...

-f data/canada.txt
fastfloat                               :  1049.34 MB/s (+/- 2.3 %)    60.30 Mfloat/s  
-f data/canada_short.txt
fastfloat                               :   651.84 MB/s (+/- 1.5 %)   121.11 Mfloat/s  
-f data/mesh.txt
fastfloat                               :   851.29 MB/s (+/- 2.7 %)   115.97 Mfloat/s  
-m uniform
fastfloat                               :  1393.54 MB/s (+/- 4.4 %)    66.42 Mfloat/s  
-m uniform -c
fastfloat                               :  1136.02 MB/s (+/- 2.6 %)    65.20 Mfloat/s  
-m simple_uniform32
fastfloat                               :  1396.41 MB/s (+/- 4.6 %)    66.56 Mfloat/s  
-m simple_uniform32 -c
fastfloat                               :  1131.92 MB/s (+/- 2.1 %)    64.97 Mfloat/s  
-m simple_int32
fastfloat                               :   786.75 MB/s (+/- 2.0 %)    84.68 Mfloat/s

lemire · 2022-11-18T16:19:54Z

So for recent compilers and recent processors, the new proposal is beneficial. It might be slightly negative on other systems. I will adopt it.

lemire · 2022-11-18T16:33:04Z

@jakubjelinek I have adopted your approach.

I will leave this PR open for a little bit. I am inviting more comments.

jakubjelinek · 2022-11-18T16:58:04Z

include/fast_float/parse_number.h

+  // There might be other ways to prevent compile-time optimizations (e.g., asm).
+  // The value does not need to be std::numeric_limits<float>::min(), any small
+  // value so that 1 + x should round to 1 would do.


This actually isn't true. The problem is with excess precision. E.g. on i?86 32-bit (or x86-64 with -mfpmath=387) floats are evaluated to the precision of long double (with -fexcess-precision=standard) or could be either way (otherwise).
With std::numeric_limits::min() it will work in this case, because even i?86 long double has just 64-bit mantissa and because std::numeric_limits::min() is 2^-126, both will still round to nearest to 1.0.
But if the value wasn't that small, but say just 2^-60, it wouldn't be true anymore. Fortunately it would then return false and so result in the safer version despite perhaps being really FE_TONEAREST.
2^-126 is good even if some target hypothetically evaluates everything in IEEE quad (113 bit mantissa) and no hw AFAIK implements 256-bit floats right now.

Point taken. I have added a remark in a follow-up commit which states "after accounting for excess precision".

jakubjelinek · 2022-11-18T17:23:06Z

include/fast_float/parse_number.h

+  //  fmin + 1.0f = 0x1.00001 (1.00001)
+  //  1.0f - fmin = 0x1 (1)
+  //
+  // FE_DOWNWARD or  FE_TOWARDZERO:
+  //  fmin + 1.0f = 0x1 (1)
+  //  1.0f - fmin = 0x0.999999 (0.999999)
+  //
+  //  fmin + 1.0f = 0x1 (1)
+  //  1.0f - fmin = 0x0.999999 (0.999999)


float is on most CPUs IEEE single format, so in the fmin + 1.0f FE_UPWARD case it would be better
to mention:
0x1.000002
rather than
0x1.00001
because that is what one really gets in that format after that addition.
The decimal representation of that is
1.00000011920928955078125 if you want to mention it.
And 0x0.999999 is certainly wrong, the right value is
0x1.fffffe
hexadecimal and
0.999999940395355224609375
decimal.

Let us take this stuff out. It is too long.

Note that my latest commit adds an optimization.

I simplified the explanation in the latest commit.

lemire · 2022-11-18T20:56:34Z

This new code is slightly insane. :-) Lots and lots of complexity. Ah well.

biojppm · 2022-11-20T19:22:48Z

Would it make sense to add a test to cover what happens when -ffast-math is enabled? The relaxed rounding mode may cause the behavior to change.

lemire · 2022-11-20T20:57:09Z

@biojppm My understanding of fast-math is that it does away with the standards. The rounding is no longer guaranteed to be exact. And we leave the bounds of the C++ standards. I could be wrong but, if so, I’d love a reference.

biojppm · 2022-11-21T14:11:15Z

Indeed -ffast-math will be a soft kind of UB for IEEE behavior. But then what kind of impact will it have, and how frequently? Will the results be wrong, or slower to come by, or both?

Eg, if the wrong answer is given, then the result would likely be that the current rounding mode is ignored (correct me if I'm wrong here), and that may or may not be bad, depending on the user's requirements.

But that is if the compiler introduces the "right" optimization. Will it? And how likely is it to do it?

If we add a test covering -ffast-math and correctness, then at least we will know some of these answers. That is, if there is a tradeoff that involves correctness or speed when -ffast-math is enabled, the users should know about it, also with some notes on the README.

And if it is evident from the outset that there is indeed a certain tradeoff, and what it looks like, then we can skip the tests, but that's even a stronger argument for making the users aware of it.

jakubjelinek · 2022-11-21T14:22:26Z

I have tried on godbolt
int
foo (float x)
{
return 1.0f - x == 1.0f + x;
}
with -O3 -ffast-math and all of GCC, clang, MSVC and ICC perform subtraction, addition, comparison.
Yes, in theory in such case -ffast-math with the assumption that rounding mode is to nearest and (from -ffast-math restrictions) that there are no NaNs nor Infinities and that sign of zero is unimportant perhaps it could fold that to just
return 1.0f + x == 1.0f; and avoid one operation (or return 1.0f - x == 1.0f;)., but no compiler I've tried does that right now.
That said, I'd say with -ffast-math it is also fine if from_chars doesn't round to nearest in non-default rounding modes, with -ffast-math one says that some inaccuracy is acceptable if it can result in faster code.

lemire · 2022-11-21T14:57:45Z

I have added a fast-math test to our CI.

lemire · 2022-11-23T15:36:52Z

Merging. I will issue a new release.

lemire · 2022-11-23T15:57:23Z

It has been released. Thanks to all.

lemire added 4 commits November 16, 2022 16:21

We might reenable clinger.

6ceb29a

Cleaning.

2c8e738

More verbose error report.

9532176

Fix for Win32+ClangCL

d225059

32-bit clangcl appears to be ridiculous.

559b89d

More tweaking around clangcl

fd9d9ef

More tuning.

8f27b7e

lemire added 2 commits November 16, 2022 15:49

Make sure that macros have actual values when defined (makes debuggin…

29b1a03

…g easier)

More tweaks.

bfc0478

Adopting proposal.

39ea41b

jakubjelinek reviewed Nov 18, 2022

View reviewed changes

lemire added 3 commits November 18, 2022 12:27

Added a remark.

3d0e448

Minor optimization.

8b7a55a

Simplifying the justification.

003a983

lemire added 2 commits November 21, 2022 09:53

Adding a fast-math test.

eec504a

Renaming the test.

968bd9d

lemire merged commit 8f092d2 into main Nov 23, 2022

lemire deleted the dlemire/renabling_clinger branch January 28, 2023 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional Clinger's fast path #153

Conditional Clinger's fast path #153

lemire commented Nov 16, 2022

lemire commented Nov 16, 2022

jakubjelinek commented Nov 16, 2022

lemire commented Nov 16, 2022

lemire commented Nov 16, 2022

jakubjelinek commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

jakubjelinek commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

jakubjelinek Nov 18, 2022

lemire Nov 18, 2022

jakubjelinek Nov 18, 2022

lemire Nov 18, 2022

lemire Nov 18, 2022

lemire commented Nov 18, 2022

biojppm commented Nov 20, 2022

lemire commented Nov 20, 2022

biojppm commented Nov 21, 2022

jakubjelinek commented Nov 21, 2022

lemire commented Nov 21, 2022

lemire commented Nov 23, 2022

lemire commented Nov 23, 2022

Conditional Clinger's fast path #153

Conditional Clinger's fast path #153

Conversation

lemire commented Nov 16, 2022

Ice Lake processor, GCC 11:

Apple M2 processor, LLVM 14:

lemire commented Nov 16, 2022

jakubjelinek commented Nov 16, 2022

lemire commented Nov 16, 2022

lemire commented Nov 16, 2022

jakubjelinek commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

jakubjelinek commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

lemire commented Nov 18, 2022

jakubjelinek Nov 18, 2022

Choose a reason for hiding this comment

lemire Nov 18, 2022

Choose a reason for hiding this comment

jakubjelinek Nov 18, 2022

Choose a reason for hiding this comment

lemire Nov 18, 2022

Choose a reason for hiding this comment

lemire Nov 18, 2022

Choose a reason for hiding this comment

lemire commented Nov 18, 2022

biojppm commented Nov 20, 2022

lemire commented Nov 20, 2022

biojppm commented Nov 21, 2022

jakubjelinek commented Nov 21, 2022

lemire commented Nov 21, 2022

lemire commented Nov 23, 2022

lemire commented Nov 23, 2022