Concrete binary floating-point types, part 2

Floating-point precision

Swift’s floating-point types are intended to model the real numbers, but (as with all such models) it is inexact. In fact, almost all real numbers cannot be represented exactly in binary floating-point format.

Background:

Even for beginners, some caveats can be important to keep in mind when working with binary floating-point values:

Many modest decimal fractions, such as 0.1, cannot be represented exactly.

Many integral values less than greatestFiniteMagnitude cannot be represented exactly.

Basic arithmetic operations are often inexact; for example, a + b - b == a does not always evaluate to true.

Though out of the scope of this article, some alternative choices for modeling real numbers are as follows:

Binary floating point
Pro: represents modest integers exactly, extremely fast hardware implementations, fixed memory size, and rounding errors are extremely uniform–they don’t vary much with the number being represented.
Con: almost no decimal fractions have exact representations.

Decimal floating point
Pro: represents modest integers and decimal fractions exactly, slower than binary but still faster than almost anything else, fixed memory size.
Con: at least an order of magnitude slower than binary floating point, and rounding error is significantly less scale-invariant.

Fixed-size rationals
Pro: represents all modestly-sized integers and fractions exactly, fixed memory size, four basic operations are exact until you hit the limits of representation.
Con: denominators quickly grow too quickly to be used for non-trivial computations (this is usually a deal-breaker).

Arbitrary-precision rationals
Pro: closed under four basic operations, represents most numbers most people will use exactly.
Con: representations get extremely large extremely quickly, large memory footprint if you have more than a few numbers.

Computable real numbers
Pro: any number you can describe, you can work with.
Con: your numbers are now computer programs, and you arithmetic system is Turing-complete. Testing for equality is equivalent to solving the halting problem.

— Stephen Canon, Sep. 12, 2016

The Swift standard library provides binary floating-point types. Foundation offers a decimal floating-point type discussed later; alternative implementations that adhere to IEEE 754, such as those from IBM (ICU License) or Intel (BSD License), can be wrapped in Swift. Third-party libraries can provide a fixed-width rational type, and existing libraries with C interfaces can be wrapped to provide an arbitrary-precision rational type (or one can be implemented natively in Swift).

Striding

Background:

As would occur in any other language, repeated addition of a stride amount to a binary floating-point value can cause accumulation of rounding error:
let stride = 0.1
var x = 1.0

x += stride
x += stride
print(x - 1.2)  // 2.2204460492503131e-16

x += stride
x += stride
print(x - 1.4)  // 4.4408920985006262e-16

In Swift 4.0, stride(from: 1.0, to: 2.0, by: 0.1) avoids accumulation of rounding error by instead computing the sequence of values as follows: 1.0, 1.0 + 1.0 * 0.1, 1.0 + 2.0 * 0.1, 1.0 + 3.0 * 0.1…

Although this method of computing the sequence avoids an undesired artifact, users should nonetheless be aware that the result will not be equivalent to that obtained by repeated addition.

Fused multiply-add

In Swift 4.0, the sequence stride(from: -0.2, through: 1.0, by: 0.2) does not include the value 1.0.

Despite avoiding accumulated rounding error from repeated addition, the expression -0.2 + 6.0 * 0.2 evaluates to 1.0000000000000002 due to intermediate rounding error.

The IEEE 754 operation fusedMultiplyAdd allows the same computation to be performed in one step with a single rounding, improving the accuracy of the result. This operation is available in Swift as the instance method addingProduct(_:_:).

For example:

-0.2 + 6.0 * 0.2               // 1.0000000000000002
(-0.2).addingProduct(6.0, 0.2) // 1

By altering the implementation of stride(from:through:by:) in Swift 4.1 to use the fused multiply-add operation, the sequence stride(from: -0.2, through: 1.0, by: 0.2) now includes the value 1.0.

The fusedMultiplyAdd operation was added to Swift as part of the Swift Evolution proposal SE-0067: Enhanced floating-point protocols. Intermediate rounding error in floating-point strides was eliminated in the Swift standard library in late 2017.

A caveat about fused multiply-add operations: Although eliminating intermediate rounding can improve the accuracy of results, it’s not always the case that an algorithm will benefit from its use. As William Kahan points out, for a sufficiently large value x, we can see a surprising result:

let x = 9007199254740991.0
(x * x - x * x).squareRoot()              // 0, as expected
(x * x).addingProduct(-x, x).squareRoot() // nan

The second result isn’t zero because x * x cannot be represented exactly as a value of type Double. In fact, (x * x).addingProduct(-x, x) actually computes the amount by which x * x is inexact. Since x * x is rounded down, the amount of inexactness is negative and its square root is not a number.

Unit in the last place

Background:

The binary floating-point representation of a real number takes the form significand × 2^exponent. The ulp, or unit in the last place, of a finite floating-point value is the value of 1 in the least significant (i.e., last) place of the significand.

In general, the property ulp is equivalent to the distance between a finite floating-point value and the nearest representable value greater in magnitude. However, greatestFiniteMagnitude.ulp is finite even though the nearest representable value greater in magnitude than greatestFiniteMagnitude is infinity.

In Swift, the ulp of a non-finite value (whether infinite or NaN) is NaN. In Java, by contrast, Math.ulp(Double.POSITIVE_INFINITY) evaluates to positive infinity, and the same result is obtained when using negative infinity as the argument.

As previously mentioned, the Swift equivalent to the C constants known as FLT_EPSILON and DBL_EPSILON is a static property named ulpOfOne. That name was chosen in order to prevent confusion surrounding the definition and proper usage of the property. As the name suggests, T.ulpOfOne is equivalent to (1 as T).ulp for any floating-point type T.

Approximating π

In Swift, each floating-point type has a static property pi that provides the value for π at its best possible precision, accurately rounded toward zero.

The purpose of rounding toward zero is to avoid angles computed in radians from being rounded to a different quadrant. As a consequence, Float.pi < Float(Double.pi) evaluates to true.

It so happens that rounding π to the nearest representable value of type Double is equivalent to rounding π toward zero. However, rounding π to the nearest representable value of type Float is not equivalent to rounding π toward zero.

Subnormal values on 32-bit ARM

Background:

In the gap between zero and 2^emin, where emin is the minimum supported exponent of a binary floating-point type, a set of linearly spaced subnormal (or denormal) values are representable using a different binary representation than that of normal finite values.

On 32-bit ARMv7, the vector floating-point (VFP) co-processor supports a flush-to-zero (FZ) mode for floating-point operations that is not compliant with IEEE 754. When the FZ bit is set, which it is by default, operations that would otherwise return a subnormal value instead return zero. Meanwhile, the NEON SIMD co-processor on ARMv7 always uses flush-to-zero mode regardless of the FZ bit.

For iOS platforms, it is possible to clear the ARMv7 FZ bit in C using inline assembler; however, doing so has a negative effect on performance. As Swift does not support inline assembler, it is not possible to disable flush-to-zero mode from Swift.

On 32-bit ARM, Swift floating-point types skip subnormal values. For example, (0 as Double).nextUp evaluates to the least normal magnitude and (0 as Double).ulp evaluates to zero.

String representation

Until Swift 4.2, floating-point values were converted to strings using an algorithm implemented in C++ that was based on the C11 function vsnprintf.

A finite value’s description and debugDescription could be different from each other because the precision of description was less than that of debugDescription.

Specifically, the precision of description was std::numeric_limits<T>::digits10 and the precision of debugDescription was std::numeric_limits<T>::max_digits10.

The previous algorithm didn’t guarantee round-trip accuracy for description, which is to say that Double(x.description) == x was not true for all values of x. In addition, the algorithm would routinely include extraneous digits in debugDescription:

// Swift 4.1
debugPrint(1.1) // 1.1000000000000001

In the past decade, new algorithms have been described that produce optimal string representations of floating-point values with good performance. In 2015, Rust switched its implementation to a combination of the Grisu3 and Dragon4 algorithms.

Beginning in Swift 4.2, floating-point values are converted to strings using a variation of the Grisu2 algorithm that incorporates changes outlined in the paper describing the Errol3 algorithm. Swift’s new algorithm, which is implemented in C, shows better performance in benchmarks than either Errol4 (the successor to Errol3) or Grisu3 with fallback to Dragon4. For values other than NaN, description and debugDescription now give the same result:

// Swift 4.2
debugPrint(1.1) // 1.1

Descriptions of NaN are discussed later.

Previous:
Concrete binary floating-point types, part 1

Next:
Concrete binary floating-point types, part 3

27 February–8 March 2018
Updated 18 August 2018