## Section: New Results

Keywords : arithmetic operators, function evaluation, division, integrated circuit, ASIC, FPGA, low-power consumption, circuit generator.

### Hardware Arithmetic Operators

Participants : J.-L. Beuchat, J. Detrey, F. de Dinechin, R. Michard, J.-M. Muller, A. Tisserand, N. Veyrat-Charvillon, G. Villard.

#### Evaluation of Functions

J. Detrey and F. de Dinechin have worked on general methods for the hardware evaluation of fixed-point elementary functions. The Higher-Order Table-Based Method is a generalization of previous first-order methods [3] to arbitrary polynomials. It provides implementations both smaller and faster thanks to the use of small multipliers and powering units [35] .

The previous fixed-point function generators have been used to build the first floating-point elementary function library for FPGAs: J. Detrey and F. de Dinechin have studied the floating-point logarithm [33] , then the exponential [34] . Although the basic floating-point operators in FPGAs are usually much slower than their processor counterparts, both elementary functions exhibit 10x speedup when compared with the processor, thanks to specific algorithms. This work will be published in the Special Issue on FPGA-based Reconfigurable Computing of the Journal of Microprocessors and Microsystems [5] .

R. Michard, A. Tisserand and N. Veyrat-Charvillon have proposed a new method for the approximation of functions without multipliers. This method uses degree-2 or degree-3 polynomial approximations with at most 3 non-zero bits for the coefficients and low precision estimations of the powers of x. The first implementation on FPGAs leads to very small operators by replacing the costly multipliers by a small number of additions [42] (best paper award of the ASAP 2005 Conference).

R. Michard, A. Tisserand and N. Veyrat-Charvillon have worked on an efficient FPGA implementation of a shift-and-add algorithm, for polynomial and rational approximation of functions. These operators are high-radix iterations of the E-method proposed by M. Ercegovac. The results show high performances by mixing the simple architecture of shift-and-add algorithms and the generic nature of polynomial and rational approximations [41] .

M. Ercegovac, J.-M. Muller and A. Tisserand have worked on the approximation of the reciprocal and the square root reciprocal in hardware. The proposed method is based on degree-1 polynomial approximation with specific coefficients and a table [36] . These approximations can be used to speed up the division and square root software iterations.

#### Division

R. Michard, A. Tisserand and N. Veyrat-Charvillon have developed a software for the generation of division circuits
[40] . This software, called
`divgen` , allows the comparison of various parameters (sizes, radix, algorithm type, optimizations...) for architecture exploration. This work has been done within a collaboration between Inria and CEA-Léti. This software is released in version 0.11.

#### Low-Power Arithmetic Operators

R. Michard, A. Tisserand and N. Veyrat-Charvillon have done a statistical study of the activity due to the selection function in the polynomial approximation algorithm called E-method and proposed by M. Ercegovac. The latitude in the choice of the result digits in the selection function, when using a redundant representation, allows to consider a reduced electrical activity in some cases. Power consumption benefits can be expected [43] .

#### Hash Functions

R. Glabb, a Phd student at ATIPS laboratory at University of Calgary, and N. Veyrat-Charvillon studied the SHA-2 family of hash functions. They implemented more efficient stand-alone versions of the separate operators, and devised a multi-mode operator able to compute all algorithms in a single architecture with a very-high level of hardware sharing between the modes. This collaboration took place during September at LIP, and in October at ATIPS.

#### Code-Based Digital Signature

An algorithm producing cryptographic digital signatures less than 100 bits long with a security level matching nowadays standards has been recently proposed by Courtois, Finiasz, and Sendrier [55] . This scheme is based on error correcting codes and consists in generating a large number of instances of a decoding problem until one of them is solved (about 9! = 362880 attempts are needed). A careful software implementation requires more than one minute on a 2GHz Pentium 4 for signing.

In 2004, J.-L. Beuchat, N. Sendrier, A. Tisserand, and G. Villard proposed a first hardware implementation which allows to sign a document in 0.86 second on an XCV300E-7 FPGA , hence making this scheme practical [54] . However, N. Sendrier modified the first step of the algorithm to prevent a possible flaw. This step involved a multiplication by a matrix stored in the memory blocks of the FPGA. Since the new version does not require this matrix anymore, intermediate results can now be stored in the memory blocks and more configurable logic is available to implement computing units. Therefore, we decided to study a new architecture from scratch. Our second architecture reduces the signature time (place-and-route results), while improving the security. We plan to design a prototype on a ZestSC1 FPGA USB board (see http://www.orangetreetech.com for details) and to publish our results.

#### Iterative Modular Multiplication

Modular multiplication is often implemented in a parallel-serial fashion: the
n-bit operand
Yis stored in a register and
Xis processed digit by digit. At each step, a partial product
2
^{i}x_{i}Y is formed and added (modulo
M) to the previous intermediate result. The well-known Montgomery's algorithm
[61] allows to design LSDF (Least Significant Digit First) algorithms. MSDF (Most Significant Digit First) schemes are
based on Horner's rule.

##### Survey and Practical Aspects

J.-L. Beuchat, J.-M. Muller, M. Neve (UCL Crypto Group), and E. Peeters (UCL Crypto Group) studied and implemented most of iterative schemes published in the open literature. They plan to carry out a fair comparison of these algorithms on FPGA and to write a survey on this topic.

##### High-Radix Carry-Save Algorithm Based on Horner's Rule

MSDF algorithms are based on the following iteration:

where Q[ r] = 0 and Q[0] = XYmod M . Several improvements of this algorithm have been proposed. The basic idea consists in computing a number congruent with Q[ i] modulo M, which requires less hardware than a modulo Maddition.

Public key cryptography often involves modular multiplication of large operands (160 up to 2048 bits). Several researchers have proposed iterative algorithms whose internal data are carry-save numbers. This number system is unfortunately not well suited to today's Field Programmable Gate Arrays (FPGAs) embedding dedicated carry logic.

J.-L. Beuchat, J.-M. Muller, R. Beguenane (Université du Québec à Chicoutimi), and S. Simard (Université du Québec à Chicoutimi) proposed to perform modular multiplication in a high-radix carry-save number system, where the
*sum bit* of the well-known carry-save representation is replaced by a
*sum word*
[24] . The originality of this approach is to analyze the modulus in order to select the most efficient high-radix
carry-save representation. Place-and-route results show that this approach reduces the area up to 50% and does not increase the critical path compared to previously published algorithms based on Horner's rule.

#### RN-Codings

A property of the original Booth recoding is that the first non-zero digit following a 1 is necessarily -1 and vice versa. This allows to prove that truncating the Booth recoding of a number Xis equivalent to rounding xto the nearest. P. Kornerup and J.-M. Muller investigated the positional, radix , number systems sharing this rounding property and called them RN-codings [38] (where ``RN'' stands for ``Round to Nearest''). J.-L. Beuchat and J.-M. Muller studied addition, multiplication, and squaring algorithms for radix 2 RN-codings (i.e. Booth recodings) [26] , [25] .

#### Publication of Previous Works

The work done in 2001–2003 by F. de Dinechin and A. Tisserand on the multipartite table method has been published in
*IEEE Transactions on Computers*
[3] .

The work done in 2002–2004 by N. Boullis and A. Tisserand on the generation of optimized circuits for the problem of multiplication by constants has been published in
*IEEE Transactions on Computers*
[13] .