CPFloat: A C library for emulating low-precision arithmetic

Fasi, Massimiliano and Mikaitis, Mantas (2020) CPFloat: A C library for emulating low-precision arithmetic. [MIMS Preprint]

[img] Text
fami20.pdf

Download (782kB)

Abstract

Low-precision floating-point arithmetic can be simulated via software by executing each arithmetic operation in hardware and rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing algorithms are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library that offers efficient routines for rounding arrays of binary32 and binary64 numbers to lower precision. The software exploits the bit level representation of the underlying formats and performs only low-level bit manipulation and integer arithmetic, without relying on costly library calls. In numerical experiments the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic available in any language.

Item Type: MIMS Preprint
Subjects: MSC 2010, the AMS's Mathematics Subject Classification > 65 Numerical analysis
Depositing User: Mr Massimiliano Fasi
Date Deposited: 20 Oct 2020 10:59
Last Modified: 20 Oct 2020 10:59
URI: http://eprints.maths.manchester.ac.uk/id/eprint/2785

Actions (login required)

View Item View Item