Concrete binary floating-point types, part 4
Signed zero, infinity, and NaN
Background:
IEEE 32-bit and 64-bit binary floating-point types have no more than one representation of each nonzero finite number, and each representation has only one encoding. (For historical reasons, the extended-precision 80-bit binary floating-point type does support a second, noncanonical encoding of some representations.)
Because the sign is represented independently of the significand, there are two representations of zero: positive zero and negative zero. They compare equal to one another but are not substitutable for all arithmetic operations. For example, the reciprocal of positive zero is positive infinity, but the reciprocal of negative zero is negative infinity. This behavior is specified by the IEEE 754 standard and is common to many languages.
When all the bits used to encode the exponent are nonzero, the encoded value is not finite. Instead, that value is positive infinity, negative infinity, or NaN (“not a number”). The rationale for supporting these values is beyond the scope of this article. Briefly, however:
The result of an operation that is too large in magnitude to be represented as a finite value overflows to positive or negative infinity. (The result of an operation that is too small in magnitude to be represented as a nonzero value underflows to zero.)
Dividing 0.0 by 0.0 is considered “invalid” and the result of that operation is NaN, as specified by the IEEE 754 standard. Other cases in which the result of an operation is NaN are carefully enumerated by the standard; the rationale for such behavior is well beyond the scope of this article.
However, for the present purposes, it’s helpful to know that NaN results can be propagated safely (not unlike “optional chaining” in Swift) through most subsequent floating-point calculations so that it’s generally unnecessary to check whether each intermediate result is NaN during a complicated calculation (if the ultimate result is of a floating-point type that can represent NaN). The behavior of NaN operands can be different if using an alternative floating-point exception behavior; as previously mentioned, it’s not possible to change the floating-point exception behavior in Swift.
NaN values can be counterintuitive for users unfamiliar with floating-point arithmetic. The IEEE 754 standard requires that NaN compare not equal to any value, including itself. Many generic algorithms expect reflexivity of equality–that is, every value should be equal to itself–but this expectation does not hold in the case of NaNs.
Because most bits available to encode a floating-point value are not necessary to indicate that the value is NaN, the IEEE 754 standard specifies that unused bits can encode a payload, which can represent diagnostic information or can be put to some other use that is entirely left to the user’s discretion. The payload is ignored by all required IEEE 754 operations but is propagated when possible. NaNs with different payloads are nonetheless all considered to represent NaN.
Signaling NaNs are a reserved subset of NaNs that are intended to signal an “invalid” floating-point exception when used as input for operations; they are then “quieted” for propagation like other NaNs. As previously mentioned, floating-point exceptions are ignored in the default floating-point exception behavior, and it’s not possible to change the default behavior in Swift.
In Swift, zero and infinity (available as the static property infinity
) behave
as they do in other languages. The sign of zero or infinity can be changed using
the prefix operator -
.
Recall that integer literals do not support signed zero (in other words,
-0 as Float
evaluates to positive zero). Either use parentheses, as in-(0 as Float)
, or use a float literal, as in-0.0 as Float
, to obtain the desired value.
In Swift, the static property nan
is a quiet NaN of the corresponding binary
floating-point type, and the static property signalingNaN
is a signaling NaN
of that type. The instance property isNaN
evaluates to true
for any NaN
value, quiet or signaling; the instance property isSignalingNaN
evaluates to
true
only for a signaling NaN.
Always use the expression
x.isNaN
(or, as is idiomatic in other languages,x != x
) instead of using the expressionx == .nan
to test for the presence of NaN.
It is possible to create a particular encoding of NaN by using the initializer
init(nan:signaling:)
, where the first argument is the payload and the second
is a Boolean value to indicate whether the result should be a signaling NaN. To
change the sign of NaN, use the prefix operator -
.
IEEE 754-2008 specifies that, for encodings of NaN, the most significant bit of the significand field is a flag that is nonzero if the NaN is quiet and zero if the NaN is signaling. All remaining bits of the significand field are part of the payload.
However, Swift also reserves the second-most significant bit of the significand field as a flag; that bit is nonzero when NaN is signaling. (The converse is not true: if the most significant bit of the significand field is nonzero, then NaN is quiet, as required by IEEE 754-2008, regardless of the value of the second-most significant bit.)
Therefore, the maximum bit width of the NaN payload is one less than that
specified by IEEE 754-2008. For example, a runtime error results when attempting
to use a 51-bit payload for a NaN of type Double
:
Double(nan: 1 << 50, signaling: false)
// Fatal error: NaN payload is not encodable
However, if a Double
value with such a payload is initialized by other means,
Swift does correctly treat the value as a quiet NaN:
let x = Double(bitPattern: 0x7ffc000000000000)
(x.isNaN, x.isSignalingNaN) // (true, false)
When an operation required by IEEE 754 returns NaN, its exact encoding is not specified by the standard. In Swift, the encoding can vary based on the underlying architecture, and results may not match that obtained in C/C++ on the same machine.
For example, in Swift on x86_64 macOS:
let x = -Double.nan
let y = x + 0
print(x.bitPattern == y.bitPattern)
// Prints "false"
By contrast, in C on x86_64 macOS:
#include <stdio.h>
#include <math.h>
union f64 {
uint64_t u64;
double d;
};
int main(int argc, const char * argv[]) {
union f64 x;
x.d = -nan("");
union f64 y;
y.d = x.d + 0.;
printf("%s\n", x.u64 == y.u64 ? "true" : "false");
return 0;
}
// Prints "true"
Describing NaN
In Swift, any NaN value is described as “nan”:
Double.nan.description // "nan"
(-Double.nan).description // "nan"
Double.signalingNaN.description // "nan"
(-Double.signalingNaN).description // "nan"
Double(nan: 1, signaling: false)
.description // "nan"
The debug description can be used to obtain more information about the sign and payload of NaN values:
Double.nan.debugDescription // "nan"
(-Double.nan).debugDescription // "-nan"
Double.signalingNaN.debugDescription // "snan"
(-Double.signalingNaN).debugDescription // "-snan"
Double(nan: 1, signaling: false)
.debugDescription // "nan(0x1)"
These debug descriptions round-trip correctly when converted back to the floating-point type:
Double("snan")!.isSignalingNaN
// true
Double("nan(0x1)")!.isNaN
// true
String(Double("nan(0x1)")!.bitPattern, radix: 16)
// 7ff8000000000001
Minimum, maximum, and total order
The global functions Swift.min(_:_:)
and Swift.max(_:_:)
are available for
comparing two values of any Comparable
type. Because these functions rely on
semantic guarantees violated by NaN, their behavior is unusual when one of the
arguments is NaN:
Swift.min(Double.nan, 0) // NaN
Swift.min(0, Double.nan) // 0
The operations minNum and maxNum are defined in the IEEE 754-2008
standard and favor numbers over quiet NaN. (If any argument is a
signaling NaN, however, the result is NaN; in a future version of
Swift, the standard library will also favor numbers over signaling
NaN.) These operations are available in Swift as the static methods
T.minimum(_:_:)
and T.maximum(_:_:)
, where T
is a floating-point type:
Double.minimum(.nan, 0) // 0
Double.minimum(0, .nan) // 0
Note that -0.0
and 0.0
are considered substitutable for the purposes of
these operations:
Double.minimum(-0.0, 0.0) // -0.0
Double.minimum(0.0, -0.0) // 0.0
Analogous operations to compare the magnitudes of two floating-point values are
known as minNumMag and maxNumMag in the IEEE 754-2008 standard and are
available in Swift as T.minimumMagnitude(_:_:)
and T.maximumMagnitude(_:_:)
,
respectively.
A total order for all possible representations in a floating-point type is
defined in IEEE 754-2008. Recall that, using standard operators, NaN compares
not equal to, not less than, and not greater than any value, and -0.0
and
0.0
compare equal to each other. The total ordering defined in IEEE 754-2008,
however, places -0.0
below 0.0
, positive NaN above positive infinity, and
negative NaN below negative infinity. It further distinguishes encodings of NaN
by their signaling bit and payload. Specifically:
- A quiet NaN is ordered above a signaling NaN if both are positive, and vice versa if both are negative.
- An encoding of NaN with larger payload (when interpreted as an integer) is ordered above an encoding of NaN with smaller payload if both are positive, and vice versa if both are negative.
This total order is available in Swift as the method
isTotallyOrdered(belowOrEqualTo:)
. As clarified in the name,
x.isTotallyOrdered(belowOrEqualTo: y)
returns true
if x
orders below
or equal to y
in the total order prescribed in the IEEE 754-2008 standard.
Floating-point remainder
Background:
IEEE 754 specifies a remainder operation that has behavior unlike that of the binary integer remainder operation.
In Swift, as in other languages, two similar remainder operations exist for floating-point types:
Nearest-to-zero remainder | Truncating remainder | |
---|---|---|
Swift | x.remainder( |
x.truncatingRemainder( |
C | remainder(x, y) |
fmod(x, y) |
C# | Math.IEEERemainder(x, y) |
x % y |
Java | Math.IEEEremainder(x, y) |
x % y |
Kotlin | x.IEEErem(y) |
x % y |
In early versions of Swift, the truncating remainder operation was spelled
%
. However, it was thought that users often used the operator incorrectly, so it was removed from floating-point types in the Swift Evolution proposal SE-0067: Enhanced floating-point protocols.
A simple example is sufficient to illustrate the difference between the two operations:
(-8).remainder(dividingBy: 5) // 2
(-8).truncatingRemainder(dividingBy: 5) // -3
The nearest-to-zero remainder of x dividing by y is the exact result r such that x = y × q + r, where q is the nearest integer value to x ÷ y. (The actual computation is performed without intermediate rounding and q does not need to be representable as a value of any type.) Notice that the result could be positive or negative, regardless of the sign of the operands; the magnitude of the result is no more than half of the magnitude of the divisor (y).
The truncating remainder of x dividing by y is the exact result s such that x = y × p + s, where p is the result of x ÷ y rounded toward zero to an integer value. (The actual computation is performed without intermediate rounding and p does not need to be representable as value of any type.) Notice that the sign of the result is the same as that of the dividend (x); the magnitude of the result is always less than the magnitude of the divisor (y).
For both operations, the remainder of dividing an infinite value by any value, or of dividing any value by zero, is NaN.
Significand representation
Background:
Recall that the binary floating-point representation of a real number takes the form significand × 2exponent. In the gap between zero and 2emin, where emin is the minimum supported exponent of a binary floating-point type, a set of linearly spaced subnormal (or denormal) values are representable using a different binary representation than that of normal finite values.
The most significant place of the significand (representing its integral part) is always nonzero for normal values, and it is always zero for subnormal values (and for zero). In the IEEE 754 32-bit and 64-bit binary floating-point formats, this most significant place is implicit; for example, the significand of a 32-bit value is 24 bits, but only 23 bits are explicitly stored in memory. By contrast, in the extended-precision
Float80
format, all 64 significand bits are explicitly stored in memory.
Any Swift binary floating-point type T
has several members related to the
significand of a value x
:
-
T.significandBitCount
The number of available fractional significand bits, orInt.max
if there is no limit (i.e., ifT
is an arbitrary precision type).
Whether implicit or explicit, the most significant place of the significand represents its integral part and is excluded from this reckoning. Therefore,Float.significandBitCount
is23
andFloat80.significandBitCount
is63
. -
x.significand
A value that can be used to compute the magnitude ofx
by multiplication withexp2(T(x.exponent))
.
Note that the sign ofsignificand
is always positive for this purpose; therefore, you will needsign
,exponent
, andsignificand
to recreate a decomposed floating-point value.
Ifx
is finite and nonzero, the result is always in the range1..<2
. (Ifx
is subnormal, its significand bit pattern is shifted so that the leading nonzero bit is in the most significant place of the significand in order to produce the result.)
Ifx
is zero, infinite, or NaN, then the result is positive zero, positive infinity, or NaN, respectively. -
x.significandWidth
The number of fractional significand bits required to represent the significand ofx
;x.significandWidth
never exceedsT.significandBitCount
.
Whether implicit or explicit, the leading nonzero bit is excluded from this reckoning; this applies even whenx
is subnormal (when the leading nonzero bit is always explicitly stored in memory), sincex.significand
would be in the range1..<2
and therefore a normal value. For example,T.leastNonzeroMagnitude
has one nonzero significand bit, andT.leastNonzeroMagnitude.significandWidth
is0
.
Meanwhile,(0 as T)
has zero nonzero significand bits, and(0 as T).significandWidth
is-1
. Ifx
is infinite or NaN,x.significandWidth
is also-1
. -
x.significandBitPattern
The fractional significand bits ofx
; you can recreate a floating-point value from itssign
,exponentBitPattern
, andsignificandBitPattern
using the initializerinit(sign:exponentBitPattern:significandBitPattern:)
.
The significand bit pattern is obtained by bit masking the raw encoding ofx
; if the most significant place of the significand (representing the integral part) is explicitly stored (as it is in values of typeFloat80
), that is excluded by the bit masking operation. However, ifx
is subnormal, the explicitly stored leading nonzero bit is not in the most significant place of the significand and is therefore not excluded; for example,T.leastNonzeroMagnitude.significandBitPattern
is1
.
It has been erroneously documented that
T.infinity.significand
is1
. Although that may have been the originally intended behavior, Swift’s implementations for concrete types have never behaved in that way.
Previous:
Concrete binary floating-point types, part 3
Next:
Numeric types in Foundation
9 March 2018–29 July 2019