Section: New Results
Hardware Arithmetic Operators
Participants : Nicolas Brisebarre, Mioara Joldes, Florent de Dinechin, Jean-Michel Muller, Bogdan Pasca, Guillaume Revy.
Complex Division, Table-Based Complex Reciprocal Approximation
A complex division algorithm introduced in 2003 by Ercegovac and Muller requires a prescaling step by a constant factor. Techniques for obtaining this prescaling factor had been mentioned by the authors, which serves to justify the feasibility of the algorithm, but were inadequate for obtaining efficient implementations. Pouya Dormiani, Milos Ercegovac (Univ. of California at Los Angeles) and Jean-Michel Muller formulated Table based solutions [26] for obtaining the prescaling factor, a low precision reciprocal approximation for a complex value, using techniques adopted from univariate function approximations. The main contribution of this work is the extension of generalized multipartite table methods to a function of two variables. The multipartite tables derived were up to 67% more memory efficient than their single table counterparts. They designed a radix-4 complex division unit that uses their technique [25] .
Hardware complex polynomial evaluation
Milos Ercegovac (Univ. of California at Los Angeles) and Jean-Michel Muller proposed [14] an efficient hardware-oriented method for evaluating complex polynomials. The method is based on solving iteratively a system of linear equations. The solutions are obtained digit-by-digit on simple and highly regular hardware. The operations performed are defined over the reals. They described a complex-to-real transform, a complex polynomial evaluation algorithm, the convergence conditions, and a corresponding design and implementation. The main features of the method are: the latency of about m cycles for an m -bit precision; the cycle time independent of the precision; a design consisting of identical modules; and digit-serial connections between the modules. The number of modules, each roughly corresponding to serial-parallel multiplier without a carry-propagate adder, is 2(n + 1) for evaluating an n -th degree complex polynomial. The design allows straightforward tradeoffs between latency and cost: a factor k decrease in cost leads to a factor k increase in latency. The proposed method is attractive for programmable platforms because of its regular and repetitive structure of simple hardware operators.
Large multipliers and squarers with fewer DSP blocks
Large integer multipliers and squarers are pervasively used to build floating-point operators. Bogdan Pasca and Florent de Dinechin proposed three methods to build them using fewer of the small multipliers available in the DSP blocks of current FPGAs [24] . A careful implementation of the classical Karatsuba-Ofman approach has a low overhead and may reduce embedded multiplier count from 4 to 3, from 9 to 6, or from 16 to 10. Building dedicated squarers entails the same savings without any overhead. For the recent Xilinx multipliers whose embedded multipliers are rectangular, these two approaches are inefficient, but a third approach, based on non-standard tiling, is proposed. The multipliers built this way are smaller and faster than those offered by vendor tools. In all the cases, the proposed architectures also try to make best use of the accumulation hardware present in the DSP blocks.
Multiplier-based square roots on FPGAs
Florent de Dinechin, Bogdan Pasca, Mioara Joldes, and Guillaume Revy also studied how these embedded multipliers can be used to implement the square root operation [34] . Compared to state-of-the-art digit recurrence approaches, an original polynomial approach, shown to be more efficient than classical quadratic iterations in this context, leads to a very short latency and low-power architecture for single precision. For double-precision, it appears that the amount of glue logic in these multiplier-based approaches is comparable to the cost of a complete digit-recurrence approach, so only the advantage of shorter latency remains.
Design of an arithmetic-oriented operator generator framework
Florent de Dinechin, Bogdan Pasca, and Cristian Klein, then an internship student, showed how the arithmetic context can be exploited to build generators of highly efficient arithmetic operators [23] . The salient features of the FloPoCo open-source architecture generator framework are: an easy learning curve from VHDL, the ability to embed arbitrary synthesizable VHDL code, portability to mainstream FPGA targets from Xilinx and Altera, automatic management of complex pipelines with support for frequency-directed pipeline, and support for automatic test-bench generation.