Concrete binary floating-point types, part 2
Swift’s floating-point types are intended to model the real numbers, but (as with all such models) it is inexact. In fact, almost all real numbers cannot be represented exactly in binary floating-point format.
Even for beginners, some caveats can be important to keep in mind when working with binary floating-point values:
- Many modest decimal fractions, such as 0.1, cannot be represented exactly.
- Many integral values less than
greatestFiniteMagnitudecannot be represented exactly.
- Basic arithmetic operations are often inexact; for example,
a + b - b == adoes not always evaluate to
Though out of the scope of this article, some alternative choices for modeling real numbers are as follows:
Binary floating point
Pro: represents modest integers exactly, extremely fast hardware implementations, fixed memory size, and rounding errors are extremely uniform–they don’t vary much with the number being represented.
Con: almost no decimal fractions have exact representations.
Decimal floating point
Pro: represents modest integers and decimal fractions exactly, slower than binary but still faster than almost anything else, fixed memory size.
Con: at least an order of magnitude slower than binary floating point, and rounding error is significantly less scale-invariant.
Pro: represents all modestly-sized integers and fractions exactly, fixed memory size, four basic operations are exact until you hit the limits of representation.
Con: denominators quickly grow too quickly to be used for non-trivial computations (this is usually a deal-breaker).
Pro: closed under four basic operations, represents most numbers most people will use exactly.
Con: representations get extremely large extremely quickly, large memory footprint if you have more than a few numbers.
Computable real numbers
Pro: any number you can describe, you can work with.
Con: your numbers are now computer programs, and you arithmetic system is Turing-complete. Testing for equality is equivalent to solving the halting problem.
The Swift standard library provides binary floating-point types. Foundation offers a decimal floating-point type discussed later; alternative implementations that adhere to IEEE 754, such as those from IBM (ICU License) or Intel (BSD License), can be wrapped in Swift. Third-party libraries can provide a fixed-width rational type, and existing libraries with C interfaces can be wrapped to provide an arbitrary-precision rational type (or one can be implemented natively in Swift).
As would occur in any other language, repeated addition of a
strideamount to a binary floating-point value can cause accumulation of rounding error:
let stride = 0.1 var x = 1.0 x += stride x += stride print(x - 1.2) // 2.2204460492503131e-16 x += stride x += stride print(x - 1.4) // 4.4408920985006262e-16
In Swift 4.0,
stride(from: 1.0, to: 2.0, by: 0.1) avoids accumulation of
rounding error by instead computing the sequence of values as follows:
1.0 + 1.0 * 0.1,
1.0 + 2.0 * 0.1,
1.0 + 3.0 * 0.1…
Although this method of computing the sequence avoids an undesired artifact, users should nonetheless be aware that the result will not be equivalent to that obtained by repeated addition.
In Swift 4.0, the sequence
stride(from: -0.2, through: 1.0, by: 0.2) does not
include the value
Despite avoiding accumulated rounding error from repeated addition, the
-0.2 + 6.0 * 0.2 evaluates to
1.0000000000000002 due to
intermediate rounding error.
The IEEE 754 operation fusedMultiplyAdd allows the same computation to be
performed in one step with a single rounding, improving the accuracy of the
result. This operation is available in Swift as the instance method
-0.2 + 6.0 * 0.2 // 1.0000000000000002 (-0.2).addingProduct(6.0, 0.2) // 1
By altering the implementation of
stride(from:through:by:) in Swift 4.1 to
use the fused multiply-add operation, the sequence
stride(from: -0.2, through: 1.0, by: 0.2) now includes the value
The fusedMultiplyAdd operation was added to Swift as part of the Swift Evolution proposal SE-0067: Enhanced floating-point protocols. Intermediate rounding error in floating-point strides was eliminated in the Swift standard library in late 2017.
A caveat about fused multiply-add operations: Although eliminating
intermediate rounding can improve the accuracy of results, it’s not always the
case that an algorithm will benefit from its use. As William Kahan points
out, for a sufficiently large value
x, we can see a surprising
let x = 9007199254740991.0 (x * x - x * x).squareRoot() // 0, as expected (x * x).addingProduct(-x, x).squareRoot() // nan
The second result isn’t zero because
x * x cannot be represented exactly as a
value of type
Double. In fact,
(x * x).addingProduct(-x, x) actually
computes the amount by which
x * x is inexact. Since
x * x is rounded down,
the amount of inexactness is negative and its square root is not a number.
Unit in the last place
The binary floating-point representation of a real number takes the form significand × 2exponent. The ulp, or unit in the last place, of a finite floating-point value is the value of 1 in the least significant (i.e., last) place of the significand.
In general, the property
ulp is equivalent to the distance between a finite
floating-point value and the nearest representable value greater in magnitude.
greatestFiniteMagnitude.ulp is finite even though the nearest
representable value greater in magnitude than
In Swift, the
ulp of a non-finite value (whether infinite or NaN) is NaN. In
Java, by contrast,
Math.ulp(Double.POSITIVE_INFINITY) evaluates to positive
infinity, and the same result is obtained when using negative infinity as the
As previously mentioned, the Swift equivalent to the C constants known as
DBL_EPSILON is a static property named
name was chosen in order to prevent confusion surrounding the definition and
proper usage of the property. As the name suggests,
T.ulpOfOne is equivalent
(1 as T).ulp for any floating-point type
In Swift, each floating-point type has a static property
pi that provides the
value for π at its best possible precision, accurately rounded toward zero.
The purpose of rounding toward zero is to avoid angles computed in radians
from being rounded to a different quadrant. As a consequence,
Float.pi < Float(Double.pi) evaluates to
It so happens that rounding π to the nearest representable value of type
Doubleis equivalent to rounding π toward zero. However, rounding π to the nearest representable value of type
Floatis not equivalent to rounding π toward zero.
Subnormal values on 32-bit ARM
In the gap between zero and 2emin, where emin is the minimum supported exponent of a binary floating-point type, a set of linearly spaced subnormal (or denormal) values are representable using a different binary representation than that of normal finite values.
On 32-bit ARMv7, the vector floating-point (VFP) co-processor supports a flush-to-zero (FZ) mode for floating-point operations that is not compliant with IEEE 754. When the FZ bit is set, which it is by default, operations that would otherwise return a subnormal value instead return zero. Meanwhile, the NEON SIMD co-processor on ARMv7 always uses flush-to-zero mode regardless of the FZ bit.
For iOS platforms, it is possible to clear the ARMv7 FZ bit in C using inline assembler; however, doing so has a negative effect on performance. As Swift does not support inline assembler, it is not possible to disable flush-to-zero mode from Swift.
On 32-bit ARM, Swift floating-point types skip subnormal values. For
(0 as Double).nextUp evaluates to the least normal magnitude and
(0 as Double).ulp evaluates to zero.
Until Swift 4.2, floating-point values were converted to strings using an
algorithm implemented in C++ that was based on the C11 function
A finite value’s
debugDescription could be different from
each other because the precision of
description was less than that of
The previous algorithm didn’t guarantee round-trip accuracy for
description, which is to say that
Double(x.description) == x was not true
for all values of
x. In addition, the algorithm would routinely include
extraneous digits in
// Swift 4.1 debugPrint(1.1) // 1.1000000000000001
In the past decade, new algorithms have been described that produce optimal string representations of floating-point values with good performance. In 2015, Rust switched its implementation to a combination of the Grisu3 and Dragon4 algorithms.
Beginning in Swift 4.2, floating-point values are converted to strings using
a variation of the Grisu2 algorithm that incorporates changes
outlined in the paper describing the Errol3 algorithm. Swift’s new
algorithm, which is implemented in C, shows better performance in benchmarks
than either Errol4 (the successor to Errol3) or Grisu3 with fallback to Dragon4.
For values other than NaN,
debugDescription now give the
// Swift 4.2 debugPrint(1.1) // 1.1
Descriptions of NaN are discussed later.
27 February–8 March 2018
Updated 18 August 2018