# Concrete binary floating-point types, part 2

## Floating-point precision

Swift’s floating-point types are intended to model the real numbers, but (as with all such models) it is inexact. In fact, almost all real numbers cannot be represented exactly in binary floating-point format.

Background:Even for beginners, some caveats can be important to keep in mind when working with binary floating-point values:

- Many modest decimal fractions, such as 0.1, cannot be represented exactly.
- Many integral values less than
`greatestFiniteMagnitude`

cannot be represented exactly.- Basic arithmetic operations are often inexact; for example,
`a + b - b == a`

does not always evaluate to`true`

.

Though out of the scope of this article, some alternative choices for modeling real numbers are as follows:

Binary floating point

Pro: represents modest integers exactly, extremely fast hardware implementations, fixed memory size, and rounding errors are extremely uniform–they don’t vary much with the number being represented.

Con: almost no decimal fractions have exact representations.

Decimal floating point

Pro: represents modest integers and decimal fractions exactly, slower than binary but still faster than almost anything else, fixed memory size.

Con: at least an order of magnitude slower than binary floating point, and rounding error is significantly less scale-invariant.

Fixed-size rationals

Pro: represents all modestly-sized integers and fractions exactly, fixed memory size, four basic operations are exact until you hit the limits of representation.

Con: denominators quickly grow too quickly to be used for non-trivial computations (this is usually a deal-breaker).

Arbitrary-precision rationals

Pro: closed under four basic operations, represents most numbers most people will use exactly.

Con: representations get extremely large extremely quickly, large memory footprint if you have more than a few numbers.

Computable real numbers

Pro: any number you can describe, you can work with.

Con: your numbers are now computer programs, and you arithmetic system is Turing-complete. Testing for equality is equivalent to solving the halting problem.

The Swift standard library provides binary floating-point types. Foundation offers a decimal floating-point type discussed later; alternative implementations that adhere to IEEE 754, such as those from IBM (ICU License) or Intel (BSD License), can be wrapped in Swift. Third-party libraries can provide a fixed-width rational type, and existing libraries with C interfaces can be wrapped to provide an arbitrary-precision rational type (or one can be implemented natively in Swift).

### Striding

Background:As would occur in any other language, repeated addition of a

`stride`

amount to a binary floating-point value can cause accumulation of rounding error:`let stride = 0.1 var x = 1.0 x += stride x += stride print(x - 1.2) // 2.2204460492503131e-16 x += stride x += stride print(x - 1.4) // 4.4408920985006262e-16`

In Swift 4.0, `stride(from: 1.0, to: 2.0, by: 0.1)`

avoids accumulation of
rounding error by instead computing the sequence of values as follows: `1.0`

,
`1.0 + 1.0 * 0.1`

, `1.0 + 2.0 * 0.1`

, `1.0 + 3.0 * 0.1`

…

Although this method of computing the sequence avoids an undesired artifact, users should nonetheless be aware that the result will not be equivalent to that obtained by repeated addition.

### Fused multiply-add

In Swift 4.0, the sequence `stride(from: -0.2, through: 1.0, by: 0.2)`

does not
include the value `1.0`

.

Despite avoiding accumulated rounding error from repeated addition, the
expression `-0.2 + 6.0 * 0.2`

evaluates to `1.0000000000000002`

due to
**intermediate rounding error**.

The IEEE 754 operation **fusedMultiplyAdd** allows the same computation to be
performed in one step with a single rounding, improving the accuracy of the
result. This operation is available in Swift as the instance method
`addingProduct(_:_:)`

.

For example:

```
-0.2 + 6.0 * 0.2 // 1.0000000000000002
(-0.2).addingProduct(6.0, 0.2) // 1
```

By altering the implementation of `stride(from:through:by:)`

in Swift 4.1 to
use the fused multiply-add operation, the sequence
`stride(from: -0.2, through: 1.0, by: 0.2)`

now includes the value `1.0`

.

The fusedMultiplyAdd operation was added to Swift as part of the Swift Evolution proposal SE-0067: Enhanced floating-point protocols. Intermediate rounding error in floating-point strides was eliminated in the Swift standard library in late 2017.

**A caveat about fused multiply-add operations:** Although eliminating
intermediate rounding can improve the accuracy of results, it’s not always the
case that an algorithm will benefit from its use. As William Kahan points
out, for a sufficiently large value `x`

, we can see a surprising
result:

```
let x = 9007199254740991.0
(x * x - x * x).squareRoot() // 0, as expected
(x * x).addingProduct(-x, x).squareRoot() // nan
```

The second result isn’t zero because `x * x`

cannot be represented exactly as a
value of type `Double`

. In fact, `(x * x).addingProduct(-x, x)`

actually
computes the amount by which `x * x`

is inexact. Since `x * x`

is rounded down,
the amount of inexactness is negative and its square root is not a number.

### Unit in the last place

Background:The binary floating-point representation of a real number takes the form

significand× 2^{exponent}. Theulp, or unit in the last place, of a finite floating-point value is the value of 1 in the least significant (i.e., last) place of the significand.

In general, the property `ulp`

is equivalent to the distance between a finite
floating-point value and the nearest representable value greater in magnitude.
However, `greatestFiniteMagnitude.ulp`

is finite even though the nearest
representable value greater in magnitude than `greatestFiniteMagnitude`

is
`infinity`

.

In Swift, the `ulp`

of a non-finite value (whether infinite or NaN) is NaN. In
Java, by contrast, `Math.ulp(Double.POSITIVE_INFINITY)`

evaluates to positive
infinity, and the same result is obtained when using negative infinity as the
argument.

As previously mentioned, the Swift equivalent to the C constants known as
`FLT_EPSILON`

and `DBL_EPSILON`

is a static property named `ulpOfOne`

. That
name was chosen in order to prevent confusion surrounding the definition and
proper usage of the property. As the name suggests, `T.ulpOfOne`

is equivalent
to `(1 as T).ulp`

for any floating-point type `T`

.

### Approximating π

In Swift, each floating-point type has a static property `pi`

that provides the
value for π at its best possible precision, accurately *rounded toward zero*.

The purpose of rounding toward zero is to avoid angles computed in radians
from being rounded to a different quadrant. As a consequence,
`Float.pi < Float(Double.pi)`

evaluates to `true`

.

It so happens that rounding π to the nearest representable value of type

`Double`

is equivalent to rounding π toward zero. However, rounding π to the nearest representable value of type`Float`

isnotequivalent to rounding π toward zero.

### Subnormal values on 32-bit ARM

Background:In the gap between zero and 2

^{emin}, whereeminis the minimum supported exponent of a binary floating-point type, a set of linearly spacedsubnormal(or denormal) values are representable using a different binary representation than that ofnormalfinite values.On 32-bit ARMv7, the vector floating-point (VFP) co-processor supports a

flush-to-zero (FZ) modefor floating-point operations that is not compliant with IEEE 754. When the FZ bit is set, which it is by default, operations that would otherwise return a subnormal value instead return zero. Meanwhile, the NEON SIMD co-processor on ARMv7 always uses flush-to-zero mode regardless of the FZ bit.

For iOS platforms, it is possible to clear the ARMv7 FZ bit in C using inline assembler; however, doing so has a negative effect on performance. As Swift does not support inline assembler, it is not possible to disable flush-to-zero mode from Swift.

**On 32-bit ARM, Swift floating-point types skip subnormal values.** For
example, `(0 as Double).nextUp`

evaluates to the least normal magnitude and
`(0 as Double).ulp`

evaluates to zero.

### String representation

Until Swift 4.2, floating-point values were converted to strings using an
algorithm implemented in C++ that was based on the C11 function
`vsnprintf`

.

A finite value’s `description`

and `debugDescription`

could be different from
each other because the precision of `description`

was less than that of
`debugDescription`

.

Specifically, the precision of

`description`

was`std::numeric_limits<T>::digits10`

and the precision of`debugDescription`

was`std::numeric_limits<T>::max_digits10`

.

The previous algorithm didn’t guarantee **round-trip accuracy** for
`description`

, which is to say that `Double(x.description) == x`

was not true
for all values of `x`

. In addition, the algorithm would routinely include
extraneous digits in `debugDescription`

:

```
// Swift 4.1
debugPrint(1.1) // 1.1000000000000001
```

In the past decade, new algorithms have been described that produce optimal string representations of floating-point values with good performance. In 2015, Rust switched its implementation to a combination of the Grisu3 and Dragon4 algorithms.

**Beginning in Swift 4.2, floating-point values are converted to strings using
a variation of the Grisu2 algorithm** that incorporates changes
outlined in the paper describing the Errol3 algorithm. Swift’s new
algorithm, which is implemented in C, shows better performance in benchmarks
than either Errol4 (the successor to Errol3) or Grisu3 with fallback to Dragon4.
For values other than NaN, `description`

and `debugDescription`

now give the
same result:

```
// Swift 4.2
debugPrint(1.1) // 1.1
```

Descriptions of NaN are discussed later.

Previous:

Concrete binary floating-point types, part 1

Next:

Concrete binary floating-point types, part 3

*27 February–8 March 2018*

*Updated 18 August 2018*