-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bf16
, f64f64
and f80
types
#3456
base: master
Are you sure you want to change the base?
Conversation
This revision also contains comments addressed from reviewers in RFC rust-lang#3451.
Having unified rule for naming is a benefit. For example,
|
Since |
text/add-bf16-f64f64-and-f80-type.md
Outdated
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
`bf16` is not a IEEE-754 standard type, so adding it as primitive type may break existing consistency for builtin float types. The truncation after calculation on targets not supporting `bf16` natively also breaks how Rust treats precision loss in other cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correctly rounding bf16
can be relatively easily implemented. bf16
add/sub/mul/div/sqrt can then just convert to f32
, do a single operation, and round to bf16
, which will always give the correct bf16
result.
round to bf16
code (not tested):
fn f32_to_bf16(v: f32) -> bf16 {
let b32 = v.to_bits();
bf16::from_bits(if v.is_nan() {
(b32 >> 16) as u16
} else if b32 & 0xFFFF == 0x8000 {
let b16 = (b32 >> 16) as u16;
b16.wrapping_add(b16 & 1)
} else {
(b32.wrapping_add(0x8000) >> 16) as u16
})
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
round to bf16
code staying entirely in SSE registers on x86_64 (also untested):
https://rust.godbolt.org/z/85Ks9sPP6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you provided looks like direct truncation. I need to confirm if the rounding behavior is fixed (tozero? tonearest? toinfinity?) or depending on system rounding mode.
Also, clang provides an option -fbfloat16-excess-precision
to specify the 'merging' behavior of bfloat operations. For example, will the intermediate result of a-b+c
be rounded? But I think that's not an issue for Rust, the value should be none
(no merging will be performed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you provided looks like direct truncation.
it is round to nearest, ties to even. truncation (round towards zero) would be only:
bf16::from_bits((f32::to_bits(v) >> 16) as u16)
I need to confirm if the rounding behavior is fixed (tozero? tonearest? toinfinity?) or depending on system rounding mode.
LLVM assumes the rounding mode is round to nearest, ties to even, unless you use the constrained fp intrinsics that rustc doesn't support (yet?).
I think |
Where will bf16 be located in libcore? The PPC and x86 floats will go in the relevant arch modules, but since bfloat is not arch-specific, it seems relevant to ask. |
I would expect |
f64f64 comes from double double, as f64=double, that's makes sense |
That's not how rust's naming convention works. Considering that this is a 128 bit float format with a slightly different exponent split than the usual |
it is actually two |
In this case |
|
one other option is we could copy the existing double-double crate and call it twofloat |
this comes with an issue that not the rust style like f16,f32,f64,f128 |
To echo something said by scottmcm on another thread: types representing 80-bit extended precision (f80) and double-double (f64f64) are specialized types that we want to make available but don't want to encourage common use of (they will forever live in core::arch), so they don't need to match Rust's primitive naming style. "ugly" names along the lines of __m128bh are fine. |
If "ugly" names is accepted, But if |
I'm not exactly sure why the double-underscore convention was adopted for the x86 types, but if we're going for consistency, f80 should be called __m80. I'm not sure if x86 has SIMD types involving 80-bit floats (I sure hope not) but if so we could also use similar naming here. |
Where does the |
the double underscore is likely because C reserves all identifiers starting with the |
BTW, __m128bh are comes from https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/intrinsics-for-avx-512-bf16-instructions.html, and it's for SIMD, |
Hard disagree. |
This isn't C, though.
Hmm, when poking around a few x86 references I found that people used I guess if we wanted to go with the prefix meaning the instruction set, we could go with |
but the |
Just randomly saw this, and this is some good timing, because I have However, I would strongly advise staying away from
|
|
I was under the impression that primitives were merely in the prelude, and their "primitive" nature simply came from the fact that they were associated with lang items. However, after looking at the prelude, this is not the case, and they are instead always present. I understand the desire to make them work with literal suffixes, but could this not be allowed without bringing the types in scope? Or maybe only with the types in scope? Perhaps this can be affected by an edition bump. The ideal way IMHO this would work is that you can always coerce a literal to the type, but in order to actually reference the type or use it via an explicit suffix, you'd have to import it. Perhaps the "explicit suffix" form might even be undesired and you would have to do it via some expression like |
text/add-bf16-f64f64-and-f80-type.md
Outdated
# Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. Because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C or C++ mode). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. Because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C or C++ mode). | |
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type, because they are not so commonly used compared to `long double`, and they are even more complex than the situation of `c_longdouble` (for example, their semantics are different under C and C++). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not the reason, indeed, on some target c_longdouble
needs _Float128
, The real reason is because we have a different RFC3453 for it. Do not said this as it's misleading.
I think we needs say in conjunction with RFC3453, we can define c_longdouble
properly on all target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PowerPC has option to control what long double
means: f64
, f128
or f64f64
. But since it is in transition to f128
as of the time of writing, we can drop a little history burden and set f128
on little-endian 64-bit targets.
I don't make sure if x86 has similar option. If not, we can confidently introduce c_longdouble
.
Co-authored-by: Jacob Lifshay <[email protected]>
Co-authored-by: teor <[email protected]>
Co-authored-by: konsumlamm <[email protected]>
Co-authored-by: konsumlamm <[email protected]>
Co-authored-by: konsumlamm <[email protected]>
Co-authored-by: konsumlamm <[email protected]>
|
||
However, besides the disadvantage of usage inconsistency between primitive types and types from crates, there are still issues around those bindings. | ||
|
||
The availablity of additional float types heavily depends on CPU/OS/ABI/features of different targets. Evolution of LLVM may also unlock the possibility of the types on new targets. Implementing them in the compiler handles the stuff at the best location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important point. Rust's AArch64 Neon (and prototype SVE) intrinsics currently lack f16
and bf16
vector support precisely because Rust cannot produce the representation that LLVM expects without the real scalar types; new-type wrappers around u16
won't work here.
This proposal (and the related #3453) enable those gaps to be filled in, I think.
`bf16` is available on all targets. The operators and constants defined for `f32` are also available for `bf16`. | ||
|
||
For `f64f64` and `f80`, their availability is limited to the following targets, but this may change over time: | ||
|
||
- `f64f64` is supported on `powerpc-*` and `powerpc64(le)-*`, available in `core::arch::{powerpc, powerpc64}` | ||
- `f80` is supported on `i[356]86-*` and `x86_64-*`, available in `core::arch::{x86, x86_64}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How useful (and feasible) would it be to also make these types target_feature
-dependent?
bf16
is an odd one in a way, because hardware accelerations (where they exist) tend to just use f32
and truncate anyway. Emulating that with f32
hardware is likely to be cheap.
However, there are new, emerging formats that are unlikely to have that property. AArch64 has some 8-bit FP formats on the way, for example. Their incorporation into Rust would have complexities, but they're different enough from existing formats that we probably wouldn't want a polyfill for hardware that doesn't have them. Instead, I'd expect them to need a target_feature
guard or similar (like Neon and SVE types).
Finally: something we observed during SVE prototyping (#3268) is that sometimes, we'd really like the target feature to be associated with the type, rather than the function. That's not quite the same as gating availability that way, but it's perhaps related.
One interesting functionality that
IEEE 754-2019 adds a section on augmented arithmetic operations, which includes addition, subtraction, and multiplication, but not division (for reasons I don't know and will not speculate on). It may be the case that future versions will grow a more general double-double library functionality for extra precision. |
Is a complete softfloat implementation strictly necessary? We could just forbid operations on |
I think that is the goal - everything here (except for bf16) would be in std::arch, only available wherever there is hardware support |
|
||
## `f80` type | ||
|
||
`f80` represents the extended precision floating point type on x86 targets, with 1 sign bit, 15 bits of exponent and 63 bits of mantissa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the size and alignment of f80
on x86? gcc can change it using the -m96bit-long-double
and -m128bit-long-double
options, although only one is conformant with the ABI.
Do we also use f80
for the 80-bit floating point format on m68k? It is nearly identical to the Intel format, although it supports normalized numbers with a biased exponent of 0 and is big endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment would be set by the ABI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, according to gcc, the ABI size is 96 bits on x86 and 128 bits on x86_64: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-m96bit-long-double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the alignment is also reduced to make use of that extra space
(From i386 abi table 2.1 at https://www.uclibc.org/docs/psABI-i386.pdf)
Better split bf16 out of this, I think the main reason |
At this point, I've personally come to expect that Rust types named |
Because these live in |
"f" in "f*" means "Float point number format". But, fair point, we expect, that "f[number]" is a part of IEEE 754. |
True - but what about the general case? If some other platform has a quirky " |
You can use |
or, you can just use a different name: like how |
|
The types listed above may be widely used in existing native code, but are not available on all targets. Their underlying representations are quite different from 16-bit and 128-bit binary floating format defined in IEEE 754. | ||
|
||
In respective targets (namely PowerPC and x86), the target-specific extended types are referenced by `long double`, which makes `long double` ambiguous in the context of FFI. Thus defining `c_longdouble` should help interoperating with C code using the `long double` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section needs some stronger motivation - bf16
is not widely used (yet), and being widely used in C isn't strong enough motivation on its own for Rust to do anything. Ideas to add:
- bf16 is popular in GPU work, and is supported as a storage format on multiple platforms (especially ARM)
- f80 can be used for platform-specific performance improvements (over
f128
), like a SIMD type - We will have something compatible with C's
long double
on every platform. Currently we only havef60
andf128
.
|
||
`bf16` consists of 1 sign bit, 8 bits of exponent, 7 bits of mantissa. Some ARM, AArch64, x86 and x86_64 targets support `bf16` operations natively. For other targets, they will be promoted into `f32` before computation and truncated back into `bf16`. | ||
|
||
`bf16` will generate the `bfloat` type in LLVM IR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place where bf16 ABI is defined, since it is a nonstandard float type? We need to make sure that GCC and LLVM are compatible here.
`core::ffi::c_longdouble` will always represent whatever `long double` does in C. Rust will defer to the compiler backend (LLVM) for what exactly this represents, but it will approximately be: | ||
|
||
- `f80` extended precision on `x86` and `x86_64` | ||
- `f64` double precision with MSVC | ||
- `f128` quadruple precision on AArch64 | ||
- `f64f64` on PowerPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust will defer to the compiler backend (LLVM) for what exactly this represents
I think you mean to say that we will make it match Clang, since there is no way to query LLVM as to what a long double
is (that logic lives in Clang, not the backend).
ARM is another notable platform where long double = f64
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
`bf16` is not an IEEE 754 standard type, so adding it as primitive type may break existing consistency for builtin float types. The truncation after calculations on targets not supporting `bf16` natively also breaks how Rust treats precision loss in other cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what this is saying - what consistency is broken that is specific to bf16? None of the float types specified here are fully specified in IEE 754 (though f80 is compatible with its extended precision definition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that on targets not supporting bf16
, a technically-incorrect and lazy but fast and sometimes good enough approximation is commonly used: doing the operations as f32
and then taking the high half of the f32
result, which has incorrect rounding (that f32
to bf16
conversion truncates instead of rounding to nearest, ties to even like all other FP normal operations).
|
||
`bf16` is not an IEEE 754 standard type, so adding it as primitive type may break existing consistency for builtin float types. The truncation after calculations on targets not supporting `bf16` natively also breaks how Rust treats precision loss in other cases. | ||
|
||
`c_longdouble` are not uniquely determined by architecture, OS and ABI. On the same target, C compiler options may change what representation `long double` uses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are referring to options like -mlong-double-128
here. This doesn't strike me as a drawback. Instead, I would mention in the c_longdouble
section that what exactly long double
represents can be changed at compile time in C but Rust won't have this option.
|
||
And since third party tools also rely on Rust internal code, implementing additional float types in the compiler also helps the tools to recognize them. | ||
|
||
# Prior art |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. [RFC #3453](https://github.com/rust-lang/rfcs/pull/3453) focuses on type conforming to IEEE 754 `binary128`. | ||
|
||
Although statements like `X target supports A type` is used in above text, some target may only support some type when some target features are enabled. Such features are assumed to be enabled, with precedents like `core::arch::x86_64::__m256d` (which is part of AVX). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you list what exactly these target features are in the reference-level explanation section? This RFC should propose whether we want to just disallow the types without relevant target features (probably acceptable) or try to polyfill them somehow (I hope not, unless somebody is extremely motivated).
# Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
This proposal does not contain information for FFI with C's `_Float128` and `__float128` type. [RFC #3453](https://github.com/rust-lang/rfcs/pull/3453) focuses on type conforming to IEEE 754 `binary128`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be dropped since f128
is in nightly now.
We need to taper the name bikeshed on this thread so other important topics don't get lost. I created a Zulip thread for naming discussion, please do continue talking about it there: https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/Additional.20float.20types.20RFC.20naming.20bikeshed Some of the bigger open questions as I understand it:
|
In most cases, the ABI of a C |
|
||
## `bf16` type | ||
|
||
`bf16` consists of 1 sign bit, 8 bits of exponent, 7 bits of mantissa. Some ARM, AArch64, x86 and x86_64 targets support `bf16` operations natively. For other targets, they will be promoted into `f32` before computation and truncated back into `bf16`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that equivalent to whatever the IEEE semantics of bf16 are, if such a thing exists (i.e., a hypothetical IEEE type with 8 bits of exponent, 7 bits of mantissa)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading other comments below it seems like this is actually an incorrect emulation. So it will be the case, for the first time in Rust history, that primitive operations such as multiplying two elements of bf16
have target-dependent behavior even when no NaNs are involved. That's a major downside of the RFC and needs to be discussed and justified more explicitly. The RFC should also state explicitly what is guaranteed to be true about bf16
arithmetic on all targets -- that's needed e.g. for unsafe code authors to know what they can rely on in terms of soundness. Furthermore, the RFC needs to specify whether on targets that have native bf16 support, it is correct for the compiler to do compile-time optimizations using emulated f32 semantics (IOW, the RFC needs to say whether there are any guarantees that bf16 on such a target will actually behave like the native bf16 of the hardware.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to emulate correct rounding semantics, so this frankly seems like a bug in those softfp impls.
It does require temporarily switching to RTZ mode, and then you can truncate the result with RTN-ties-even.
Edit: Actually NVM, the above procedure still has an error from the correctly rounded result, of at most -2^17*ULP. You'd have to first check FE_INEXACT and just how you round accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe alternatively the answer for all these float types is -- the resulting bit patterns are entirely unspecified and not guaranteed to be portable in any way.
But even then we need to document how deterministic they are. Does passing the exact same inputs to an operation multiple times during a program execution always definitely produce the exact same outputs, on all targets and optimization levels? For regular floats, the answer turns out to be "only when there are no NaNs" -- that's what #3514 is all about. Sometimes, the same operation with the same inputs on the same target can produce different results depending on optimization levels and how obfuscated the surrounding code is. Even if we don't want to specify the bits that are produced by these operations, we need to specify whether results are consistent across all programs on a given target (define "target" -- is it per-triple or per-architecture), or only consistent across all operations in a single execution, or arbitrarily inconsistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, you can do a single add/sub/mul/div/sqrt bf16
operation by promoting to f32
, doing the operation in f32
, and then rounding back to bf16
using the same rounding mode, not truncating. That is assuming, of course, that bf16
actually meets the conditions which are iirc something like having bf16
's mantissa bit count be less than half of f32
's mantissa bit count minus 1 or 2.
this is like how you can do that with f32
and f64
, which is how you can do f32
operations in JavaScript by using Math.fround
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for the regular IEEE formats, yes -- it was proven here.
But I don't know if bf16 is enough like an IEEE format to make that theorem apply.
Also, does LLVM when it compiles bf16 to f32 guarantee to do the rounding back to bf16
after each and every operation, never doing more than one operation "at once" in f32
mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for the regular IEEE formats, yes -- it was proven here.
But I don't know if bf16 is enough like an IEEE format to make that theorem apply.
it is, bf16
is just f16
with a few more exponent bits and a few less mantissa bits, everything else is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bf16
is just f32
with the lower 16 mantissa bits dropped. As f64
values that can be rounded to f32
are effectively f32
values with 29 extra mantissa bits, there would be no difference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does LLVM when it compiles bf16 to f32 guarantee to do the rounding back to
bf16
after each and every operation, never doing more than one operation "at once" inf32
mode?
that I don't know, but I hope LLVM at least tries to be correct in non-fast-math mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that I don't know, but I hope LLVM at least tries to be correct in non-fast-math mode
That should definitely be noted as something to figure out before stabilization.
|
||
`f64f64` is the legacy extended floating point format used on PowerPC targets. It consists of two `f64`s, with the former acting as a normal `f64` and the latter for an extended mantissa. | ||
|
||
The following `From` traits are implemented in `core::arch::{powerpc, powerpc64}` for conversion between `f64f64` and other floating point types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this type does not have semantics that are equivalent to any IEEE float type. But we need some document to explain exactly what their semantics are, i.e. the exact bits you get out when doing arithmetic on values of this type. Does such a document exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. For the most part, it would act like an f64x2 vector (that multiplication/division, etc. would cross), but when exactly bits will move between the two is a question that would need to be answered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part, it would act like an f64x2 vector
wait, that's not at all how f64f64
arithmetic works, it instead works more like a big-integer. e.g. here's multiplying two double-double values in the twofloat
crate: https://docs.rs/twofloat/0.7.0/src/twofloat/arithmetic.rs.html#145
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's more like what I expected. Is there an explanation somewhere of what the "meaning" of such an (a, b)
pair is, i.e. what is its mathematical-valued semantics? Is it a + b
(where this is mathematical inf-precision +
on rational numbers)?
Is the behavior of all basic operations on that kind of representation exactly guaranteed, the same way IEEE exactly guarantees behavior for our regular floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the unusual semantics and the somewhat legacy nature of f64f64
, would it be better to just provide a type with no methods/trait implementations (apart from Copy
/Clone
, similar to the arch-specific SIMD types), and leave a fully featured f64f64
implementation to crates like twofloat
? AFAIK PowerPC doesn't provide any hardware acceleration for f64f64
specifically, so the only thing that couldn't be done outside the compiler/std would be supporting the f64f64
C ABI, which external crates can then use in a #[repr(transparent)]
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not knowing the details of the fast_two_sum
function, that looks to be a binomial product, which is what I was referring to with "multiplication/division, etc., would cross" though I guess I wasn't quite clear on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the unusual semantics and the somewhat legacy nature of
f64f64
, would it be better to just provide a type with no methods/trait implementations
sounds good to me! though I'd at least have Copy
, Clone
, Default
, and Debug
, where Debug
could just be as if it was: struct f64f64 { high: f64, low: f64 }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not knowing the details of the
fast_two_sum
function, that looks to be a binomial product, which is what I was referring to with "multiplication/division, etc., would cross" though I guess I wasn't quite clear on that.
ok, yeah. addition and subtraction also don't act like a f64x2
, about the only ops that act like f64x2
are neg, abs, and copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not knowing the details of the
fast_two_sum
function, that looks to be a binomial product, which is what I was referring to with "multiplication/division, etc., would cross" though I guess I wasn't quite clear on that.
So -- what is the mathematical value of a pair (a, b)
then, the rational number this represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's more like what I expected. Is there an explanation somewhere of what the "meaning" of such an
(a, b)
pair is, i.e. what is its mathematical-valued semantics? Is ita + b
(where this is mathematical inf-precision+
on rational numbers)?
yes, it is high + low
where the number is the exact mathematical sum of two f64
s
Is the behavior of all basic operations on that kind of representation exactly guaranteed, the same way IEEE exactly guarantees behavior for our regular floats?
I've heard that many special functions (like sin, cos, etc.) don't even always return canonical values (as in the result is represented differently than the exact same number would be by arithmetic ops), idk which ones.
All proposed types do not have literal representation. Instead, they can be converted to or from IEEE 754 compliant types. | ||
|
||
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these types, what is their interaction with #3514 -- i.e., what exactly is guaranteed (or not) about their NaN values? Is there any other kind of non-determinism for any of them?
|
||
## `f80` type | ||
|
||
`f80` represents the extended precision floating point type on x86 targets, with 1 sign bit, 15 bits of exponent and 63 bits of mantissa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is following strict IEEE float semantics, just with different exponent/mantissa sizes than the existing types we have? That should be stated explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ish. x87 has some weirdness in subnormal and nonfinite values, and it has an explicit integer bit, unlike the other interchange formats (which directly results in the aformentioned weirdness).
Rendered
Previous RFC #3451 mixes proposal for IEEE-754 compliant
f16
/f128
and such non-standard types, split this off from it to focus on the target related ones.