Fixmath
1.4
|
Fixmath is a library of fixed-point math operations and functions. It is designed to be portable, flexible and fast on platforms without floating-point support:
All fixed-point numbers are represented as 32-bit signed integers. Since the numbers are signed, we think of the most significant bit as the sign bit, even though it is not a true sign bit in two's complement representation. For a fixed-point number, we also have a binary point, separating the integer bits from the fraction bits. If we have n integer bits and m fraction bits, the format is
where s is the sign bit, and we have the constraint n + m = 31. In this library the fixed-point formats are often written as Qn.m or Q.m. The binary point is implicit in the sense that it is not stored in the number itself. It is up to the user to keep track of the binary points in different computations.
For clarity, there is a fixed-point type fixed_t that is always a 32-bit signed integer. Other integers may also be used, but they will be converted to 32-bits before the operation. There is also a type for representing reciprocal values, fx_rdiv_t, see the section on reciprocal value division.
All functions are prefixed with fx_
. Otherwise, the names follow the naming convention of the math library in C99 as much as possible. This means that functions are named as their floating-point equivalents, with a suffix x
for fixed-point.
A fixed-point number is generally specified as a representation xval
, and the number of fraction bits (binary point), frac
.
The error bound is measured in ulp, unit in the last place, which is the smallest representable value. The error for the conversions and the arithmetic operations is within 0.5 ulp. For the remaining math functions the error is higher. The error bounds for these functions are determined empirically using random tests.
Forcing all math functions to be correct to 0.5 ulp would result a severe performance hit. In normal use cases it is generally acceptable to loose one or two bits of precision if this makes the operation several times faster. If a higher precision is needed, choose a fixed-point format with more fraction bits.
There are no special number such as NaN
and Inf
in the IEEE-754 floating-point standard. For any fixed-point format, all possible bit patterns in the 32-bit representation are valid numbers.
There is currently no support for saturating arithmetics. If the computed result is outside the representable range, it will wrap around.
Some fixed-point operations are independent of the number of fraction bits. This means that they are the same as in the special case of zero fraction bits, i.e. integers. Since the fixed-point numbers are represented as integers, the normal integer operations in C may be used in these cases. This applies to the following:
0
is always 0
in all fixed-point formats.==
, !=
, <
, <=
>
and >=
can be used if both operands have the same binary point.+
, +=
, -
, -=
can be used if both operands have the same binary point.+
and -
always work.No checks or error handling is performed by the conversions and the arithmetic operations. In most operations an error condition will just produce an erroneous result, but fx_divx() and fx_rdivx() may trigger divide-by-zero and NULL pointer reference errors, respectively.
The math functions check the arguments and sets the errno
variable on errors, similar to what the corresponding functions in the C math library does.
For completeness, fx_addx() and fx_subx() are also provided for addition and subtraction. They are equal to the integer operations +
and -
, respectively.
For optimal performance, the frac
arguments should be constants in all conversions and arithmetic operations.
Using fx_ftox() and fx_dtox() with constant arguments will generate a fixed-point compile-time constant. No run-time floating-point operations will be performed.
Use the base-2 logarithm and exponential functions instead of the natural or base-10 versions. It is also more efficient to compute 2ylog2(x) instead of using fx_powx(). This is because of extra overhead imposed by the normalization and the checks that must be done in order to minimize the loss of precision in the intermediate results. The user who knows the data range and fractional format can cut corners here.
There is a benchmark application in the prof
directory. It runs performance tests for all API functions and some floating-point equivalents. The result is of course platform-dependent, but for CPUs without floating-point support in hardware, the fixed-point arithmetic operations are around 10 times faster than the corresponding software-emulated floating-point operations. For many of the math functions, the speed-up is around 100 times.
All operations are implemented as functions. This makes it possible to take the address of the function for all operations. However, some functions may also be implemented as macros, and they may evaluate the frac
argument twice. For this reason it is an error to call a function with a frac
argument that has side effects, e.g. frac++
. The conversion operations fx_ftox() and fx_dtox() may also evaluate floating-point argument twice.
The library is written in ANSI-C, but it requires the C99 stdint.h
header. The correctness tests require also the C99 stdbool.h
header and the Check library [1]. The benchmark application, which is not built by default, requires the POSIX sigaction() and setitimer() facilities.
The library contain the following functionality:
To use the fixmath library, include the header <fixmath/fixmath.h>
, and link the application with -lfixmath
, or use pkg-config
to determine the compiler and linker flags.
Assume that we want to compute the 2-norm of a data vector, i.e.
All data points are in the range [0-1]. The floating-point implementation is shown below.
When converting the operation to fixed-point, we need to choose a fixed-point format, i.e. the number of integer and fraction bits. It is of course dependent on the range of the data set. In this example, we use 16 bits for the fraction part, i.e. Q.16.
In the previous example, the data array was given in fixed-point format. Now we want to compute the same result on the original data that is stored as 8-bit unsigned integers. Since the range of the input data now is in the range [0-255], we want to normalize them by dividing with 255.
The implementation below computes the sum of squares as integers. The normalization is performed at by converting the numerator and the denominator to the fixed-point format before computing the normalizing division. Unfortunately, the converted operands are too large for the Q.16 fixed-point format, and will overflow.
We can solve the problem by performing the fixed-point division and conversion simultaneously. The number of fraction bits of the division result is fx - fy + frac
, where f1 and f2 are the number of fraction bits for the first and second operands, respectively. In our case, both f1 and f2 are zero. If we use 16 bits as the frac
argument, we will get 16 fraction bits for the result. The complete implementation is shown below.
We want to normalize the 8-bit data to the range [0-1] and store it in a Q.16 fixed-point format. To do this, we normalize and convert the data in one step, as shown in the next code example.
When benchmarking the above implementation, it turns out to be quite slow. This caused by the per-data point division in the inner loop. In a floating-point implementation one would multiply each data point by the reciprocal value 1/255 instead of dividing by 255. When using fixed-point arithmetics, the reciprocal value may very well be outside the representable range. For this reason there is a special data type for storing reciprocal values, fx_rdiv_t. The functions fx_invx() and fx_isqrtx() both accept an optional fx_rdiv_t argument that can be used for performing multiple division by the same value. In the code below we compute the reciprocal value outside the loop, and then use the fx_rdivx() operation, which is really a multiplication, instead of the fx_div() division operation.
[1] Check, http://check.sourceforge.net