Roundoff error and finite precision representations

Roundoff error and finite precision representations#

import numpy as np
from numpy import binary_repr, pi, e, sqrt

Round-off errors are an unavoidable consequence of representing numbers with a finite number of digits. This issue is particularly prominent with irrational numbers, but it also affects rational numbers that cannot be perfectly represented in a given base.

print(f"Pi: {pi}")
print(f"e: {e}")
print(f"1/3: {1/3}")
print(f"sqrt(2): {sqrt(2)}")

Pi: 3.141592653589793
e: 2.718281828459045
1/3: 0.3333333333333333
sqrt(2): 1.4142135623730951

The fundamental problem is that we can only store a finite number of digits for any given number. As a result, we must either round or truncate the number, leading to a small error. While this error may seem insignificant for a single calculation, it can accumulate over millions of computations, leading to significant discrepancies.

The average human is capable of around one mistake per second, but computers can make millions of mistakes a second!

The magnitude of round-off error is influenced by the precision (the number of digits used) and the base of the number system.

Binary Representation#

Humans typically use a base-10 numbering system, also known as the decimal system. This is likely because we have ten fingers.

An exception to this is the Mayans, who used a base-20 (vigesimal) system.

In the decimal system, each digit’s position corresponds to a power of 10. For example, the number 1305 can be expressed as:

\(1305_{10} = 5 \times 10^0 + 0 \times 10^1 + 3 \times 10^2 + 1 \times 10^3\)

Computers, on the other hand, use a base-2 (binary) system, where each bit can be either 0 or 1. The same number, 1305, is written in binary as:

binary_repr(1305)

'10100011001'

We can verify this by converting the binary representation back to decimal:

\(10100011001_2 = 1 \times 2^0 + 0 \times 2^1 + 0 \times 2^2 + 1 \times 2^3 + 1 \times 2^4 + \)

\(\quad 0 \times 2^5 + 0 \times 2^6 + 0 \times 2^7 + 1 \times 2^8 + 0 \times 2^9 + 1 \times 2^{10}\)

\(= 1305_{10}\)

Binary representation can also include fractional parts, similar to a decimal point. For example, the decimal number \(54.75_{10}\) can be written in binary.

def decimal_to_binary(number):
    """Converts a decimal number to its binary representation."""
    # Convert the integer part to binary
    integer_part = binary_repr(int(number))

    # Convert the fractional part to binary
    fractional_part = number - int(number)
    binary_fractional_part = ""
    for i in range(20):
        fractional_part *= 2
        if fractional_part >= 1:
            binary_fractional_part += "1"
            fractional_part -= 1
        else:
            binary_fractional_part += "0"
        if fractional_part == 0:
            break

    # Combine the integer and fractional parts
    binary_representation = integer_part + "." + binary_fractional_part
    return binary_representation

print(f"54.75 in binary is: {decimal_to_binary(54.75)}")

54.75 in binary is: 110110.11

Check: \(110110.11_2 = (1 \times 2^5 + 1 \times 2^4 + 0 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 0 \times 2^0) + \)

\(\quad (1 \times 2^{-1} + 1 \times 2^{-2}) = 54.75_{10}\)

Example: Convert 0.1 to Binary#

print(f"0.1 in binary is: {decimal_to_binary(0.1)}")

0.1 in binary is: 0.00011001100110011001

The binary representation of 0.1 is a repeating fraction, which means it cannot be perfectly represented with a finite number of bits.

Precision#

Computers store data in units called words. The number of bits in a word determines the precision. The IEEE standard for floating-point arithmetic defines the following precisions:

Precision	Number of Bits
Single	32
Double	64
Quad	128

For comparison, the binary representation of 1305 (10100011001) requires 11 bits. Modern computers typically use double precision for floating-point calculations, which is the default in Python 3.

Integers#

Integers are a fundamental data type for signed numbers and do not suffer from round-off error. However, they have a limited range based on the number of bits used for their storage.

The range of values an integer can store is determined by \(2^{\text{bits}}\). For signed integers, this range is split between positive and negative values. Due to the representation of zero, the negative range is slightly larger than the positive range.

The minimum and maximum values for a signed integer are:

\(min = -2^{\text{bits}-1}\)

\(max = 2^{\text{bits}-1} - 1\)

Modern computers use a method called Two’s Complement to represent signed integers, rather than a simple sign bit.

Example: What is the largest integer a double-precision variable can store?#

print(f"min: {-2**63}")
print(f"max: {2**63-1}")

min: -9223372036854775808
max: 9223372036854775807

We can check with the built-in numpy examiner:

print(np.iinfo(np.int64))

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

Example 2: Overflow error#

What happens when you store a number too large?

 # works because 2**62 is within the range of a 64-bit integer
print(np.int64(2**62))

# This will cause an overflow error because 2**63 is too large
print(np.int64(2**63))

4611686018427387904

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[8], line 5
      2 print(np.int64(2**62))
      4 # This will cause an overflow error because 2**63 is too large
----> 5 print(np.int64(2**63))

OverflowError: Python int too large to convert to C long

Let’s try some more.

# This also works
print(np.int64(1000000000000000000))

# This will also cause an overflow error
print(np.int64(10000000000000000000))

1000000000000000000

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[9], line 5
      2 print(np.int64(1000000000000000000))
      4 # This will also cause an overflow error
----> 5 print(np.int64(10000000000000000000))

OverflowError: Python int too large to convert to C long

Floating-Point Numbers#

Writing out very large numbers like \(10,000,000,000,000,000,000\) is impractical. It is much more efficient to use scientific notation to represent the magnitude as an exponent:

\(10,000,000,000,000,000,000 = 10^{19}\)

Floating-Point Decimal Numbers (Engineering Notation)#

We can remove leading and placeholder trailing zeros by using a floating point to separate the fractional part (mantissa or significand) from the order of magnitude (exponent).

Engineering Notation: \(mantissa \times 10^{exponent}\)

Decimal	Engineering Notation	Mantissa	Exponent
\(265.73\)	\(2.6573 \times 10^2\)	2.6573	2
\(0.0001\)	\(1 \times 10^{-4}\)	1	-4
\(-0.0034123\)	\(-3.4123 \times 10^{-3}\)	-3.4123	-3
\(1500^*\)	\(1.5 \times 10^3\)	1.5	3

*Assuming the trailing zeros are not significant.

Note:

The mantissa is a fraction. If we demand that the decimal point be after the first digit, we can drop the decimal point and represent the manittisa as an integer.
The exponent is the power of the number system’s base (in this case, 10).

def to_engineering_notation(number):
    """Converts a number to engineering notation."""
    if number == 0:
        return "0"

    exponent = 0
    while abs(number) < 1:
        number *= 10
        exponent -= 1
    while abs(number) >= 10:
        number /= 10
        exponent += 1

    return f"{number}E{exponent}"

print(to_engineering_notation(265.73))
print(to_engineering_notation(0.0001))
print(to_engineering_notation(-.0034123))
print(to_engineering_notation(1500))
print(to_engineering_notation(0))

2.6573E2
1.0E-4
-3.4123E-3
1.5E3
0

Floating-Point Binary Numbers#

The same floating-point concept can be applied to binary numbers using base 2:

\(mantissa \times 2^{exponent}\)

Example: Convert 54.75 into Floating-Point Binary#

\(54.75_{10} = 110110.11_2\)

To normalize this, we move the binary point so that it is after the first non-zero digit:

\(1.1011011_2 \times 2^5\)

The exponent is 5, which in binary is \(101_2\). So the full floating-point representation is:

\(1.1011011_2 \times 2^{101_2}\)

Precision in Floating-Point Numbers#

If the mantissa and exponent had infinite range, we could represent all numbers perfectly. However, we are limited by the number of bits (precision). The bits are divided into three parts according to the IEEE 754 standard:

Precision	Total Bits	Sign	Exponent	Mantissa
Single	32	1	8	23
Double	64	1	11	52
Quad	128	1	15	112

The ‘sign’ bit determines whether the number is positive or negative. The ‘exponent’ bits store the exponent, and the ‘mantissa’ bits store the fractional part of the number.

Example: How is 0.1 actually stored?#

print(format(0.1, '.55f'))

0.1000000000000000055511151231257827021181583404541015625

As you can see, the stored value is not exactly 0.1. This is because 0.1 is (surprinsingly) a repeating fraction in binary and cannot be perfectly represented in finite precision.

Round-off error#

The Roundoff error is the difference between the true number and the finite-precision representation. The errors it can produce can be mitigated by increased precision at the cost of computational expense. They are important to keep in mind when working with numbers of similar or vastly different scales.

Different magnitudes of numbers#

Finite precision means small numbers can get rounded inadvertantly when combined with large numbers. There we cannot always rely on the associative property of addition:

print(f"-1 + (1 + 1e-20) = {-1+(1+1e-20)}")
print(f"(-1 + 1) + 1e-20 = {(-1+1)+1e-20}")

-1 + (1 + 1e-20) = 0.0
(-1 + 1) + 1e-20 = 1e-20

Compilers and simplification steps may mitigate these errors. However, it still motivates the use of reference values in computations. E.g.: Consider the temperature change from room temperature instead of the absolute tempeature.

Subtractive cancellation#

Subtractive cancellation happens when two nearly equal numbers are subtracted, which can drop the number of significant digits dramatically.

# Define two nearly equal numbers
a = np.float32(1.23456789)
b = np.float32(1.23456780)

# Perform subtraction
result = a - b

# Print the results with higher precision
print(f"a = {format(a, '.20f')}")
print(f"b = {format(b, '.20f')}")
print(f"a - b = {result}")

a = 1.23456788063049316406
b = 1.23456776142120361328
a - b = 1.1920928955078125e-07

This error appear depending on the algorithms, even in elementary cases:

E.g.: The binomial theorem

\( x = \frac{-b\pm \sqrt{b^2-4ac}}{2a}\)

if \(4ac << b^2\) this will result in \(b-b\) in the numerator.

E.g.: The rate of change

\(\Delta T = T_{final} - T_{initial}\)

generates spurious results when \(T_{final} \approx T_{initial}\).

Although small, algorithms tend to have many steps and these errors can accumulte unless the algorithm is designed robustly, contributing to their numerical stability.