 Overall Objectives
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Bibliography   PDF e-Pub

Section: New Results

Floating-Point arithmetic

Error bounds on complex floating-point multiplication with a fused-multiply add

The accuracy analysis of complex floating-point multiplication done by Brent, Percival, and Zimmermann [Math.  Comp., 76:1469–1481, 2007] is extended by Peter Kornerup (Odense Univ. Denmark), Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller  to the case where a fused multiply-add (FMA) operation is available. Considering floating-point arithmetic with rounding to nearest and unit roundoff $u$, they show that the bound $\sqrt{5}\phantom{\rule{0.166667em}{0ex}}u$ on the normwise relative error $|\stackrel{^}{z}/z-1|$ of a complex product $z$ can be decreased further to $2u$ when using the FMA in the most naive way. Furthermore, they prove that the term $2u$ is asymptotically optimal not only for this naive FMA-based algorithm, but also for two other algorithms, which use the FMA operation as an efficient way of implementing rounding error compensation. Thus, although highly accurate in the componentwise sense, these two compensated algorithms bring no improvement to the normwise accuracy $2u$ already achieved using the FMA naively. Asymptotic optimality is established for each algorithm thanks to the explicit construction of floating-point inputs for which it is proven that the normwise relative error then generated satisfies $|\stackrel{^}{z}/z-1|\to 2u$ as $u\to 0$. All these results hold for IEEE floating-point arithmetic, with radix $\beta \ge 2$, precision $p\ge 2$, and rounding to nearest; it is only assumed that underflows and overflows do not occur and, when bounding errors from below, that ${\beta }^{p-1}\ge 12$.

Refined error analysis of the Cornea-Harrison-Tang method for $ab+cd$

In their book Scientific Computing on Itanium-based Systems, Cornea, Harrison, and Tang introduced an accurate algorithm for evaluating expressions of the form $ab+cd$ in binary floating-point arithmetic, assuming a fused-multiply add instruction is available. They showed that if $p$ is the precision of the floating-point format and if $u={2}^{-p}$, the relative error of the result is of order $u$. Jean-Michel Muller improved their proof to show that the relative error is bounded by $2u+7{u}^{2}+6{u}^{3}$. Furthermore, by building an example for which the relative error is asymptotically (as $p\to \infty$ or, equivalently, as $u\to 0$) equivalent to $2u$, he proved that this error bound is asymptotically optimal  . Claude-Pierre Jeannerod then showed in  that an error bound of the form $2u+2{u}^{2}+O\left({u}^{3}\right)$ in fact holds for any radix $\beta \ge 2$, with $u=\frac{1}{2}{\beta }^{1-p}$. He also showed that the possibility of removing the $O\left({u}^{2}\right)$ term from this bound depends on the radix parity and the tie-breaking strategy used for rounding: if $\beta$ is odd or rounding is to nearest even then the simpler bound $2u$ is obtained, while if $\beta$ is even and rounding is to nearest away, then there exist floating-point inputs $a,b,c,d$ that lead to a relative error larger than $2u+\frac{1}{\beta }{u}^{2}$.

Improved error bounds for numerical linear algebra

When computing matrix factorizations and solving linear systems in floating-point arithmetic, classical rounding error analyses provide backward error bounds whose leading terms have the form ${\gamma }_{n}=nu/\left(1-nu\right)$ for suitable values of $n$ and with $u$ the unit roundoff. With Siegfried M. Rump (Hamburg University of Technology), Claude-Pierre Jeannerod showed in  that for LU and Cholesky factorizations as well as for triangular system solving, ${\gamma }_{n}$ can be replaced by the $O\left({u}^{2}\right)$-free and unconditional constant $nu$. To get these new bounds the main ingredient is a general framework for bounding expressions of the form $|\rho -s|$, where $s$ is the exact sum of a floating-point number and $n-1$ real numbers, and where $\rho$ is a real number approximating the computed sum $\stackrel{^}{s}$.

On relative errors of floating-point operations

Rounding error analyses of numerical algorithms are most often carried out via repeated applications of the so-called standard models of floating-point arithmetic. Given a round-to-nearest function $\text{RN}$ and barring underflow and overflow, such models bound the relative errors ${E}_{1}\left(t\right)=|t-\text{RN}\left(t\right)|/|t|$ and ${E}_{2}\left(t\right)=|t-\text{RN}\left(t\right)|/|\text{RN}\left(t\right)|$ by the unit roundoff $u$. In  Claude-Pierre Jeannerod and Siegfried M. Rump (Hamburg University of Technology) investigated the possibility of refining these bounds, both in the case of an arbitrary real $t$ and in the case where $t$ is the exact result of an arithmetic operation on some floating-point numbers. They provided explicit and attainable bounds on ${E}_{1}\left(t\right)$, which are all less than or equal to $u/\left(1+u\right)$ and, therefore, smaller than $u$. For ${E}_{2}\left(t\right)$ the bound $u$ is attainable whenever $t=x±y$ or $t=xy$ or, in base $>2$, $t=x/y$ with $x,y$ two floating-point numbers. However, for division in base 2 as well as for square root, smaller bounds are derived, which are also shown to be attainable. This set of sharp bounds was then applied to the rounding error analysis of various numerical algorithms: in all cases, they obtained either much shorter proofs of the best-known error bounds for such algorithms, or improvements on these bounds themselves.

Correctly rounded sum of floating-point numbers in GNU MPFR

Vincent Lefèvre has designed a new algorithm to compute the correctly rounded sum of several floating-point numbers, each having its own precision and the output having its own precision, as in GNU MPFR. At the same time, the mpfr_sum function is being reimplemented (not finished yet). While the old algorithm was just an application of Ziv's method, thus with exponential time and memory complexity in the worst case such as the sum of a huge number and a tiny number, the new algorithm does the sum by blocks (reiterations being needed only in case of cancellations), taking such holes between numbers into account.