Accurate and Efficient Floating Point Summation
- 1 January 2004
- journal article
- Published by Society for Industrial & Applied Mathematics (SIAM) in SIAM Journal on Scientific Computing
- Vol. 25 (4), 1214-1248
- https://doi.org/10.1137/s1064827502407627
Abstract
We present and analyze several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to approximately 2F-f numbers can be added accurately by simply summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place (ulps). We apply this result to the floating point formats in the IEEE floating point standard. For example, a dot product of single precision vectors of length at most 33 computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps. If double-extended precision is used, the vector length can be as large as 65,537. We also investigate how the cost of sorting can be reduced or eliminated while retaining accuracy.Keywords
This publication has 14 references indexed in Scilit:
- Accuracy and Stability of Numerical AlgorithmsPublished by Society for Industrial & Applied Mathematics (SIAM) ,2002
- The Accuracy of Floating Point SummationSIAM Journal on Scientific Computing, 1993
- Parallel algorithms for the rounding exact summation of floating point numbersComputing, 1982
- Software for Doubled-Precision Floating-Point ComputationsACM Transactions on Mathematical Software, 1981
- Floating-Point Computation of Functions with Maximum AccuracyIEEE Transactions on Computers, 1977
- Formalization and implementation of floating-point matrix operationsComputing, 1976
- Correction d'une somme en arithmetique a virgule flottanteNumerische Mathematik, 1972
- On accurate floating-point summationCommunications of the ACM, 1971
- A floating-point technique for extending the available precisionNumerische Mathematik, 1971
- Quasi double-precision in floating point additionBIT Numerical Mathematics, 1965