## Section: New Results

### Algorithm Architecture Interaction

Participants : Steven Derrien, Romuald Rocher, Daniel Ménard, François Charot, Christophe Wolinski, Olivier Sentieys, Patrice Quinton.

#### Computation Accuracy Optimization

Participants : Daniel Ménard, Karthick Parashar, Olivier Sentieys, Romuald Rocher, Hai-Nam Nguyen.

##### Dynamic Precision Scaling

The traditional approach to design a fixed-point system is based on the worst-case principle. For example, for a digital communication receiver, the maximal performance and the maximal input dynamic are retained and the more constraint transmission channel is considered. Nevertheless, the noise and the signal levels evolve during time. Moreover, the data rate depends on the service (video, image, speech) used by the terminal and the required performance (bit error rate) is linked to the service. These various elements show that the fixed-point specification depends on external elements (noise level, input signal dynamic range, quality of service) and can be adapted during time to reduce the average power consumption.

An approach in which the fixed-point specification is adapted dynamically according to the input receiver SNR (Signal-to-Noise Ratio) has been proposed in a concept called *Dynamic Precision Scaling (DPS)* . To adapt the fixed-point specification during time, the architecture integrates flexible operators as presented in Section
6.1.1 .

This year, our work on Dynamic Precision Scaling (DPS) has been carried on. This technique allows adapting the fixed-point specification to reduce the power consumption. Our approach interest has been shown on a WCDMA receiver example [57] . A new approach has been proposed to estimate more accurately the data dynamic range [56] . The properties of the application are taken into account to reduce the pessimistic effects of classical analytical approaches like interval arithmetic. The accuracy constraint used in the fixed-point optimization problem is determined from the required application performance. For the bit error rate, the analytical expression of the accuracy constraint according to the bit error rate has been proposed. Our work is now focused on the methodology to find the optimized fixed-point specification. The aim is to find the appropriate optimization algorithm which allows minimizing the implementation cost under accuracy constraint.

##### Fixed-Point Accuracy Evaluation

Analytical techniques have been proposed to accelerate the performance evaluation step, which is the most time consuming step during optimization. The inability to handle all types of operator analytically and the increasing diversity and complexity of signal processing algorithms demand a mixed evaluation approach where both simulation and analytical techniques are used for performance evaluation of the whole system. We also proposed this year to use the spectral density estimate for noise power calculation by having an approximate filter thereby accelerating the process of performance evaluation. We applied this approach to the SSFE (Selective Spanning for Fast Enumeration) algorithm in collaboration with Imec (Interuniversitair Micro-Electronika Centrum), Belgium.

The presence of decision operators has proved to be a serious impediment for a fully analytical noise power estimation technique. We develop a generalized decision operator which can potentially capture the behavior of all possible types of decision operators and provides a fully analytical technique to handle them while performing quantization noise power estimation.

#### Arithmetic Implementation on GPUs

Participant : Arnaud Tisserand.

##### Arithmetic Library for Cryptography on GPUs

In [44] , we present first implementation results on a modular arithmetic library for cryptography on GPUs. Our library, in C++ for CUDA, provides modular arithmetic, finite field arithmetic and some ECC supports. Efficient algorithms and implementations are required for a±b mod p , a×b mod p where a , b and p are multiple precision integers and p is prime. Those operations are required in finite field arithmetic over GF (p) and in elliptic curve cryptography (ECC) where sizes are about 200–600 bits. Graphic processor units (GPUs) are used in high-performance computing systems thanks to their massively multithreaded architectures. But due to their specific architecture and programming style, porting libraries to GPUs is not simple even using high-level tools such as CUDA. This work is a part of a software library called PACE [5] . This library is aimed at providing a very large set of mathematical objects, functions and algorithms to facilitate the writing of arithmetic applications.

##### Power Consumption of GPUs

In [35] , we investigate how and where the power consumption is located within a GPU board by analyzing the relations between the measured power consumption, the required time and the type of units that are stressed to perform a defined operation. In this paper, we consider Nvidia GPUs used for GPGPU (General Purpose computing using GPU) in a CUDA environment. During the analysis, functional blocks are identified, and their power is characterized using physical measurements. The considered blocks correspond to units that are usually stressed while executing common kernels on the GPU: register file, memory hierarchy and functional units. In addition to the power estimation, our analysis gives us some information on the organization of the memory hierarchy, the behavior of functional units and some undocumented features.

#### Multi-Antenna Systems

Participants : Olivier Berder, Pascal Scalart, Quoc-Tuong Ngo.

Considering the possibility for the transmitter to get some Channel State Information (CSI) from the receiver, antenna power allocation strategies can be performed thanks to the joined optimization of linear precoder (at the transmitter) and decoder (at the receiver).

A new exact solution of the maximization of the minimum Euclidean distance between received symbols has been proposed for two 16-QAM modulated symbols [54] . This precoder shows an important enhancement of this minimum distance compared to diagonal precoders which leads to a significant BER improvement. This new strategy selects the best precoding matrix among eight different expressions, depending on the value of the channel angle. In order to decrease the complexity, other sets of precoders have been proposed and the performance of the simplest one, composed of only two different precoders, remains very close to the optimal in terms of BER.

An efficient sub-optimal MIMO linear precoder based on the maximization of minimum distance has been proposed for three virtual subchannels [53] . A new virtual MIMO channel representation with two channel angles allows the parameterization of the linear precoder and the optimization of the distance between signal points at the received constellation. As these precoders need a Singular Value Decomposition (SVD) of the propagation channel, an optimized architecture of SVD was proposed for an FPGA implementation [71] .

#### Parallel reconfigurable architectures for LDPC decoding

Participants : Florent Berthelot, François Charot, Charles Wagner, Christophe Wolinski.

LDPC codes are a class of error-correcting code introduced by Gallager with an iterative probability-based decoding algorithm. Their performances combined with their relatively simple decoding algorithm make these codes very attractive for the next satellite and radio digital transmission system generations. LDPC codes were chosen in DVB-S2, 802.11n, 802.16e, 802.3an and CCSDS standards. The major problem is the huge design space composed of many interrelated parameters which enforces drastic design trade-offs. Another important issue is the need for flexibility of the hardware solutions which have to be able to support all the declinations of a given standard.

Previously we have defined a generic architecture template that is composed of several processing modules and a set of interconnection buses for inter-module communications. Each module includes two processing units (called *bitnode* and *checknode* processing units), and a set of memory banks. The number of modules, the number of interconnection buses, the size and the number of memory banks are standard dependent.

This year, we have proposed a generic architecture for a CCSDS LDPC decoder. This architecture uses the regularity and the parallelism of the code and a genericity based on an optimized storage of the data. Two FPGA implementations have been proposed: the first one is low-cost oriented and the second one targets high-speed decoder [39] . Moreover in the context of the RPS2 project, we have designed a parallel architecture suited to the decoding of LDPC for the digital video broadcast DVB-S2 standard. Due the huge codeword length used in the DVB-S2 standard only partly parallel architectures are feasible. The designed architecture exploit the periodicity nature of DVB-S2 LDPC codes.

#### Algorithm Optimization for Low Energy in Wireless Applications

Participants : Olivier Berder, Tuan-Duc Nguyen, Vinh Tran, Olivier Sentieys.

In wireless distributed networks, where multiple antennas can not be integrated in one node, Cooperative Multi-Input Multi-Output (C-MIMO) techniques help to exploit the space time diversity gain in order to increase performance or to reduce the transmission energy consumption. In [14] , strategies using Cooperative MIMO techniques were proposed for Wireless Sensor Network (WSN) where the energy consumption is the most important design criterion. The performance and the energy consumption advantages of Cooperative MIMO technique were investigated, in comparison with the SISO (Single-Input Single-Output), multi-hop SISO and cooperative relay techniques, and an optimal selection of transmit-receive antennas number in terms of energy consumption was also proposed as a function of transmission distances.

Since the wireless nodes are physically separated in cooperative MIMO systems, the imperfect time synchronization between cooperative nodes clocks leads to an unsynchronized MIMO transmission. The performance degradation of this cooperative transmission synchronization error and the cooperative reception additional noise is evaluated by simulations. Two new cooperative reception techniques based on the relay principle and a new efficient space-time combination technique were proposed to increase the energy efficiency of cooperative MIMO systems. Finally, performance and energy consumption comparisons between cooperative MIMO and relay techniques are performed and an association strategy is also proposed to exploit simultaneously the advantages of the two cooperative techniques.

Two MIMO simple and full cooperative relay models are proposed by associating space time codes and cooperative relay. In these two models, a two-antenna source transmits space time codes to two relays and to destination at the same time. The relay nodes use a new Amplify and Forward (AF) protocol and a new Decode and Forward (DF) protocol based on the Alamouti space-time code to forward the signals to destination. The simulations show that a higher performance can be achieved by using these two models in comparison with Alamouti scheme.

#### Wireless Communications for Automotive Systems

Participants : Olivier Berder, Tuan-Duc Nguyen, Olivier Sentieys, Jérome Astier, Arnaud Carer, Thomas Anger.

The CAPTIV (Cooperative strAtegies for low Power wireless Transmissions between Infrastuctures and Vehicles) project aims at using new radio communications technologies in order to enhance drivers security. In a cooperative network composed of vehicles and road signs equipped with autonomous radio transmitters, the communications can be optimized at different levels. It was shown that space-time codes allow to dramatically decrease the energy consumption of communications between crossroads. In order to both elaborate CAPTIV application program and evaluate the driver behaviour in front of this new kind of information, a specific driving simulator was designed, based on the ECA-FAROS platform. A real prototype has already been evaluated and proves the feasibility of CAPTIV application, and it will be soon optimized thanks to signal processing techniques. If the main goal remains driving assistance, many applications could be implemented on this platform and it will be able to deliver any kind of information (meteo, parking, tourist information, advertisement etc.) [29] ,[76] .

For wireless communications between infrastructure and vehicles, cooperative strategies were defined in order to choose the most energy efficient techniques between cooperative MIMO and relay in the CAPTIV context [55] . The performance of an association of both techniques in terms of Bit Error Rate and energy efficiency was also evaluated and analyzed in [21] .

#### True Random Number Generators

Participants : Renaud Santoro, Olivier Sentieys, Arnaud Tisserand, Philippe Quémerais, Arnaud Carer, Thomas Anger.

The objective of a random number generator (RNG) is to produce random binary numbers which are statistically independent, uniformly distributed and unpredictable. RNGs are necessary in many applications and the number of embedded hardware architectures requiring RNGs is continuously increasing. Generally, a hybrid RNG comprising a True Random Number Generator (TRNG) and a Pseudo Random Number Generator (PRNG) is used. PRNGs are based on deterministic algorithms. They are periodic, and must be initialized by a TRNG. TRNGs are based on a physical noise source (e.g. thermal noise or free running jitter oscillators) and depend strongly on their implementation quality. Most of the TRNGs implemented in FPGA or ASIC use phase jitter produced by a free running oscillator or a Phase-Locked Loop (PLL) [97] . In practice, jitter can be influenced by noise external to the FPGA (power supply noise, temperature) and by chip activity. This dependence is a weakness exploitable by exposing the TRNG in hostile environment conditions [117] .

In cryptography, security is usually based on the randomness quality of a key generated by an RNG. Some PRNGs are recognized to produce high quality random numbers [103] . However, their quality depends on TRNG seed randomness. PRNG randomness evaluation is usually performed by using a battery of statistical tests. Several such batteries are reported in the literature including Diehard [106] and NIST [115] batteries. They are all implemented using high-level software programming. When an PRNG is evaluated, designers put a huge bit stream into memory and then submit it to software tests. If the bit stream successfully passes a certain number of statistical tests, the PRNG is said to be sufficiently random. TRNG validation is more complicated as their behavior depends on their construction, on external environments and essentially on a physical noise source which can differ in practice from an ideal noise. However, [100] has described a methodology to evaluate physical generators. The procedure is based on TRNG construction and is the technical reference of the AIS 31 [83] . TRNG weaknesses and external attacks must be prevented on real-time to inhibit TRNG output [117] , and a solution is to monitor the TRNG at switch on and during operations by using statistical tests [100] , [117] .

##### Evaluation of TRNGs under Various Experimental Conditions

Attacking TRNGs is a good solution to decrease the security of a cryptosystem leading to lower security keys or bad padding values for instance. Recently, new TRNGs have been proposed in the literature, however, selecting a robust and efficient TRNG is a difficult problem. To the best of our knowledge, no real and objective comparison of several TRNGs appears in the literature. During this year, we have investigated the randomness behavior, the area and the power consumption of recent TRNGs implemented into FPGA circuits [65] . The randomness of the generator output has been evaluated by using hardware accelerated statistical tests [66] .

This year, the possibility to implement the AIS 31 statistical tests in hardware has been studied. Then, the tests have been implemented into ASIC and FPGA targets. The hardware cost shows that the design can be used into low-cost embedded cryptography circuits. Moreover, the data-rate obtained by the designed hardware tests allows to monitor TRNG in real-time.

##### Ochre: a Circuit for *On-Chip Randomness Extraction*

This year has seen the design and fabrication of an integrated circuit prototype (Ochre) including our architecture proposal for hybrid RNG [15] . The chip is composed of a TRNG based on several free-running ring-oscillators, a cellular-automata-based PRNG and some hardware statistical tests including the FIPS 140-2. The tests monitor the TRNG quality in real time to validate the PRNG seed randomness as proposed in [15] , [66] . Ochre has been fabricated in a 130 nm CMOS technology from STMicrocelectronics and is able to reach 800 Mbit/s for 0.3mm^{2} and 5mW at 200MHz (see Fig. 6 ). The circuit has been successfully tested after fabrication.
A second version of this chip is currently being designed and will be fabricated in Spring 2010. Ochre V2 will include more stringent statistical tests and also a high-quality PRNG. This chip is intended for quality evaluation of several TRNGs into a VLSI technology.

##### Arithmetic Operators for Evaluation of TRNG Randomness Quality

In [67] , we propose arithmetic operators for the on-the-fly evaluation of TRNG randomness quality. We use the Maurer's universal test included in the NIST and the AIS-31 test-benches. This test requires a large number of arithmetic units and memory banks. So optimization is important for embedded implementations. One of the main task is the evaluation of the Harmonic series for large index values. We use DeTemple-Wang mathematical approximation and polynomial approximations "well-suited" for high-performance hardware implementations. Based on several recent results on computer arithmetic, one can generate very optimized polynomial approximations for one interval. The degree of the polynomial used for an accurate approximation on the whole interval is too high. In this paper, we present a method for splitting the interval into "well-chosen" intervals and one low-degree polynomial per interval. We detail the design and the optimization of the approximation operator. We also present its implementation on FPGAs and the obtained results for on-the-fly evaluation of TRNGs.

#### Flexible hardware accelerators for biocomputing applications

Participants : Steven Derrien, Naeem Abbas, Patrice Quinton.

It is nowadays acknowledged that FPGA-based hardware acceleration of
compute intensive bioinformatics applications is a very viable alternative
to cluster (or grid) based approach. One of the issues with this technology
is that it remains somewhat difficult to use and to maintain (one is
rather designing a circuit rather than programming a machine), and even
though there exists several C-to-hardware compilation flows (Mitrion-C,
C2H, Gaut, Impulse-C, etc.), they do not offer good enough performance to
justify the use of reconfigurable technology.
Most successful hardware implementations of bio-computing algorithms were
therefore designed by hand at the RTL level by targeting a specific
reconfigurable system (if not a specific FPGA technology).
Maintaining/upgrading and porting such implementations to other/new
systems is therefore a very tedious task, and always comes at the price of
a very sensitive performance loss (indeed a complete rewrite is often
required).
The use of retargetable IP core generators, that are capable of producing
an optimized RTL hardware description given a high-level system
description of the accelerator, could leverage the use of FPGA and
reconfigurable technology for this type of application. Yet there exists
such generators for signal processing kernels (FFT, DCT, etc.) and
specialized arithmetic functions (mostly floating point), no contribution
has been done in the field on biocomputing.
This research work, which is part of the ANR BioWiic project, aims at
providing a framework for helping semi-automatic generation of flexible IP
cores, by widening the scope typical design constraints so as to integrate
communication and data reuse optimisations between the host and the
hardware accelerator. This research work builds upon the * Cairn research
group expertise on automatic parallelization for application specific
hardware accelerators. Considered target applications include HMMer,
ClustalW and BLAST.*