Numerics - Introduction

Numbers are abstract concepts in mathematics, but calculation with numbers requires a representation and algorithms. The representation of a number is called a numeral. There are different representation systems for numerals. for example:

Numeral representation of 27

The Roman system was used in Europe during the Middle Ages. It took centuries before the Europeans adopted the much more efficient Arabic, or decimal system. The decimal system is called a radix, $r=10$, or base, $b=10$, numeral system. These are positional numeral systems. A base-$b$ system has $b$ unique symbols, digits, $\{0,1,\ldots,b-1\}$. Any number is represented uniquely by a string of symbols $d_{n-1}d_{n-2} \ldots d_0$ where the position of a digit determines its value and the total value of the string is defined by the sum of the value of each digit in the following way: \[ d_{n-1}b^{n-1}+d_{n-2}b^{n-2}+\ldots+d_0b^0=\sum_{i=0}^{n-1} d_ib^i=(d_{n-1},d_{n-2},\ldots,d_0)_b \] where $d_i$ is a digit and $b$ the base. So each numeral is represented by string of symbols and this string of symbols is associated with a counting algorithm. The result of the counting algorithm is the definition of the number.

Number representation in computers

With the onset of electronic computers the use of base-2 ( binary) numbers became popular, because their use of binary digits, or bits, having only two possible values 0 and 1, is compatible with electronic signals (low and high voltage) Base-8 (octal) and base-16 (hexadecimal) numbers have been used as shorthand notation for binary numbers.

Representation of a number requires space. In a computer space is a scarced resource. If the maximum lenght of a string representing a number is $k$, also called the word lenght then the maximum number represented by a string of lenght $k$ in a base $b$ numeral system is $b^k-1$. The required length of a string in a base $b$ system to represent a number $M$ follows from $k=\lceil \log_b M \rceil$.

In any computer system not only the positive natural numbers must be represented but also the negative integers and the real numbers with a decimal fraction. For representation of negative numbers in a computer one could simply add one sign bit, 0=positive and 1=negative. However the most popular representation of negative numbers is the so called 2's-complements representation. In a base-2 number system with word lenght $k$ the negative number $-x$ is encoded as $2^k-x$. See the encoding of positive and negative numbers for $k=4$ in the figure below. Note that negative numbers always have their most significant bit set to 1. The value of a 2's complement representation is: \[ (d_{n-1},d_{n-2},\ldots,d_0)_{\text{2's complement}}=-d_{n-1}2^{n-1} +\sum_{i=0}^{n-2} d_ib^i \]

2's complement

Real numbers with a fraction are represented in a computer as either fixed point or floating point numbers. A fixed point number is represented in a computer by an integral and fractional part. The position of the base point (decimal or binary) is almost always implied. If a fixed point number has $k$ whole digits and $l$ fractional digits its value follows from: \[ (d_{k-1},d_{k-2},\ldots,d_0,d_{-1},d_{-2},\ldots,d_{-l})_{b}=\sum_{i=-l}^{k-1} d_ib^i \] so, the digits to the right of the base point have positive powers and the digits to the left negative.

Fixed point numbers are however not practical if the range of numbers to work with is very large. For that purpose floating point numbers have been introduced. A floating point number is represented as follows: \[ x=\pm f*B^e \;\; 1/B \leq f \less 1 \] where $s$ is the sign, $f$ is called the significand or mantissa, $B$ the base and $e$ the exponent. As a real number can be represented in different ways, the representation is normalized by minimizing the exponent and maximizing the significand. The advantage of the floating point representation is that the exponent is now stored and can accommodate a large range of very large and very small numbers. However the disadvantage is that within that range the representation of numbers has a limited precision.

To standardize the implementation of floating poins the IEEE-754 standard which later became known as IEC 60559:1989, Binary floating-point arithmetic for microprocessors has been emitted. This standard has been used by Intel in the design of their 8087 numerical co-processor (FPU) and its later successors. This standard is now widely adopted by most computer manufacturers. A floating point number is represented in this standard as follows: \[ x=(-1)^s \frac{f}{2^{N-1}}*2^e \] where $s$ is the sign bit (0=positive,1=negative), $f$ the significand or mantissa stored as an unsigned integer, $N$ the number of bits of the mantissa, and $e$ the exponent.

A normalized number has the most significant bit of the mantissa set to 1, and thus \[ 1 \leq \frac{f}{2^{N-1}} = 1.dd \ldots d \less 2 \] The most significant bit is implicit and also called a hidden bit. The lenght of the mantissa $N$ will be reported including this hidden bit.

The exponent is stored as an unsigned integer with the value e+bias, where the bias is a number depended on the size in bits of the exponent field. The exponent 0 and $2*(bias+1)$ have a special meaning. The biased exponent 0 is used to represent the number zero as well as subnormal or denormalized numbers, that is numbers with leading zeros in the significand. When both the biased exponent and the significand are zero, the resulting value is equal to 0. Changing the sign of zero does not change its value, so we have two possible representations of zero: one with a positive sign, and one with a negative sign. If the biased exponent is all 1's and the significand is all 0's, then the number represents infinity. The sign bit indicates whether we're dealing with positive or negative infinity. These numbers are returned for operations that either do not have a finite value (e.g. 1/0) or are too large to be represented by a normalized number. The sign of a division by zero depends on the sign of both the numerator and the denominator. If you divide +1 by negative zero, the result is negative infinity. If you divide -1 by positive infinity, the result is negative zero. If the significand is different from 0, the value represents a Not-a-Number value or NaN.

The minimum possible value: \[ \abs{\text{min}}=2^{-\text{bias } +1 - N} \] The minimum normalized possible value: \[ \abs{\text{min-normalized}}=(2^N)*2^{-(\text{bias }-1)+1-N} \] The maximum possible value: \[ \abs{\text{max}}=(2^N-1)2^{\text{bias } + 1 - N} \]

On Intel chips there are three type of floating points numbers: single, double and extended precision.

Intel floating point numbers

All calculations are done in 80bit floating point registers.

Elementary functions

In this section we show the elementary functions upon which all calculations in computers are based. These functions are normally defined in hardware for performance reasons. A compiler will translate these functions to direct calls of these hardware implemented functions. We give here for each function a software implementation for illustrative purposes.

Square root

The square root implementation is based on the Babylonian square root algorithm for a floating-point, binary machine.

This function approximates the square root of a number > 0 with the Babylonian algorithm.   

uses from math.h:

double frexp(double x, int * exp);
Breaks the floating point number x into its binary significand: (a floating point value between 
0.5(included) and 1.0(excluded)) and an integral exponent for 2, such that:

x = significand * 2 ** exponent 

The exponent is stored in the location pointed by exp, and the significand is the value returned by 
the function.

If x is zero, both parts (significand and exponent) are zero.

double ldexp (double x, int exp );
Returns the resulting floating point value from multiplying x (the significand) by 2 raised to the 
power of exp (the exponent): x * 2 ** exp
M_SQRT1_2  0.707106781186547524401
double SQRT(double x) {
	int exp;
	double significand;

	// check if x>0
	if(x <=0)
		return x;
	// get significand to find the square root of

	// accuracy depends on number of iterations: 0=7.04 bit/1=15.08 bit/2=31.16 bit/3=63.32 bit
	//initial guess for iteration 
   double register y = 0.41731 + 0.59016 * significand;
   y += significand / y;
   // y2
   y = 0.25*y+significand / y;
   // y3
   y =0.5(y+significand / y);   
   if( (exp%2)!=0 ) {
   return ldexp(y, exp);


Copyright 2012 Jacq Krol. All rights reserved. Created August 2012; last updated August 2012.