Section: New Results
Dynamically and Heterogeneous Reconfigurable Platforms
New Reconfigurable Architectures
Flexible Arithmetic Operator Design
Participants : Emmanuel Casseau, Daniel Ménard, François Charot, Christophe Wolinski, Shafqat Khan.
Our aim is to propose new arithmetic operators flexible in term of data size. Targeted applications are typically multimedia processing. To optimize fixed-point implementations, architectures must
offer operators which support different data word-lengths. Operator efficiency can be increased using subword parallelism (SWP) scheme. A single SWP instruction performs the same operation on multiple sets of subwords in parallel using SWP operators. In the existing SWP capable processors, the choices for subword data sizes are usually 8, 16, 32 bits etc. The reason behind the selection of these subword sizes being the less complexity of SWP operator design especially when subword sizes are multiple of the smallest subword size. However in multimedia applications, the input data (pixels) for computations are 8, 10, 12 and sometimes 16 bits. These multimedia data sizes are not in coordination with existing processor's subword sizes resulting in the under utilization of processor resources. Operators which can support multimedia oriented subword sizes (8, 10, 12 and 16) are required. Multimedia operations are based on basic operators (add, absolute value, multiply) but more complex operations are also required to increase both speed an efficiency. For instance operation is required in the calculation of SAD,
operation is required for the multiplication-accumulation operation used in the DCT algorithm etc. To overcome the overheads of reconfigurations such as the complexity of the interconnection network and the reconfiguration time, we designed a pipelined multimedia operator which provides reconfigurability inside the operator using a configurable datapath [46] . The operator can be configured to perform most of multimedia operations on different data sizes without any need of reconfiguration time. This operator will be used as one computing unit inside a reconfigurable processor tailored for multimedia applications [52] .
Adaptive and Multi-mode Devices
Participants : Emmanuel Casseau, Antoine Floch, Erwan Raffin, Daniel Ménard, Shafqat Khan, François Charot, Christophe Wolinski, Olivier Sentieys.
In a mobile society, more and more devices need to continuously adapt to changing environments that is to say devices will have to be flexible to implement different algorithms at different times. Such mode switches require more than just software based changes but also adaptation of the application specific hardware components. To issue this requirement, we investigate two ways. The first one is the design of a reconfigurable processor able to adapt its computing structure to a dedicated domain: video and image processing applications. The processor is built around a pipeline of coarse grain reconfigurable operators exhibiting a good trade-off between performance and power consumption. On the contrary of what has been done in previous reconfigurable processors, flexibility is not obtained through the use of a flexible interconnect network but on the use of configurable domain-dedicated units [52] . This work is done in the context of the ROMA ANR project. We particularly investigate reconfigurable operator design and compilation framework. The second way is the synthesis of multi-mode architectures which do not lead to any reconfiguration time penalty. Such architectures implement all required operators according to the pre-defined set of computations to be performed. In order to optimize area, these operators are shared between the set of algorithms, and some control logic steers the data to operators depending on the particular algorithm to be executed at a specific time. Syntheses can be constrained for performance or area. Targeted domains are typically channel encoding, cryptography and multimedia [43] . This work is done through a collaboration with IMS Lab. (B. Le Gal).
Reconfigurable Architecture Description Language
Participants : Julien Lallet, Sébastien Pillement, Olivier Sentieys, François Charot.
Our research aims at defining a platform model for the definition of dynamically reconfigurable architectures and associated methods. The main objective is to have a unified and formal specification of the platform that can be efficiently exploited in retargetable compilation flows, and in automated back-end generators for simulation and synthesis. The model is defined to cover different models of architectures, from fpga s to networks of processors, through coarse-grained reconfigurable data-path.
This method allows to easily develop a new dynamically reconfigurable architecture based on computing resources and generic interconnection schemes, to explore performance and to validate the architecture by simulations at different levels of abstraction. The definition of the architecture is done with the help of a high-level architecture description based on the MAML language developed at the University of Erlangen-Nuremberg. The first part of the work defined structures that permit to interconnect different kinds of computing resources (configurable logic blocks, reconfigurable functional units or processors) and to produce the required reconfiguration resources for an homogeneous reconfiguration process. Different architecture paradigms (FPGA, reconfigurable datapaths such as DART or regular parallel processor architectures such as WPPA) can thus be quickly modeled. The second part of this work consisted in the generation of the configuration controller, after analyzing the MAML specifications of the architecture and of the reconfiguration resources produced. This work leads to the development of the Mozaic framework. This tool is able to generate a reconfigurable platform and to explore some important parameters (reconfiguration costs and time, flexibility and size of interconnect, number of resources). The proposed reconfiguration paradigm for computing and interconnect resources has been optimized for very fast reconfiguration process, which is essential to reach the timing constraint required by today's applications. Implementation of a wireless receiver has been tested on various architectures generated by our tool and has shown the efficiency of our methodology applied to reconfigurable systems [19] , [48] , [73] .
Arithmetic Operators and Number Representations
Participants : Arnaud Tisserand, Stanislaw Piestrak.
Arithmetic Operators for Cryptography
Redundant number systems have been introduced to speed-up some computations. In a redundant number system, some numbers have several distinct representations. This property is used in some number systems to allow constant time addition (the addition time does not depend on the number of digits). Redundant number systems have been used in cryptography for a long time. Recodings of some values into a redundant number system are frequent. For instance, Non-Adjacent Forms (NAF and w-NAF) are used in modular exponentiation in RSA and in scalar multiplication in ECC. Redundancy is used to lower the number of some operations. In [82] and [31] we present some investigations on links between redundant number systems and reconfigurable arithmetic units with countermeasures against some side channel attacks. The use of redundant number system allows to change the way some computations are performed (and then their effects on side channel analysis/attacks). The frequency (internal iteration level, field operation level, curve operation level...) and the location (digit level, number level, curve point level...) of the reconfigurations widely impact units characteristics. We present first results on reconfigurable arithmetic units for cryptography.
Dedicated Arithmetic Operators
In [68] , we study the design of dedicated function approximation operators based on the mix of two recent techniques: low-degree polynomial approximation proposed in [11] and estimated arithmetic operators proposed in [113] . The method proposed in [11] allows to design operators for function approximation dedicated to hardware implementation. The generated operators use low-degree polynomial approximations where the coefficients are selected for accuracy and implementation purpose. Estimated arithmetic proposed in [113] deals with arithmetic circuits with approximated result. Some internal signals such as carries are not computed. Estimated arithmetic trades accuracy for speed, silicon area and power. Adders and multipliers have been investigated using estimated arithmetic. In [68] , we study various trade-offs between the degree of the polynomial, its coefficients selection, the data-path size and the accuracy of the estimated arithmetic operators. The obtained operators are small and fast, and they provide a small average error but a few large errors may occur for some very infrequent arguments. Typical delay improvements are about 20–60% and 15–30% area reduction compared to previous results.
Number Representation for Digital Signal Processing (DSP)
Two's complement number systems impose a fundamental limitation on the power and performance of arithmetic circuits,
due to the fundamental need of cross-datapath carry propagation.
Residue Number System (RNS) breaks free of these bonds by decomposing a number into parts and performing arithmetic
operations in parallel.
In [34] , we proposed to extend the instruction set architecture with separate instructions for RNS computations.
The basic RNS components were designed in RTL Verilog and synthesized using the 0,18 OSU standard cell library with the
Cadence Encounter ® RTL Compiler.
An application mapping problem on the proposed RNS extension that includes both instruction selection and instruction scheduling was formulated and solved.
Our experiments not only demonstrate simultaneous improvement of up to 30% in performance and 57% reduction in functional
unit power consumption, but also that most of these benefits can be exploited with automatically generated code.
The compiler technique introduced in this work could also benefit from improving the profit model to model instruction execution more accurately.
The limitations of RNS include difficult implementation of non-modular operations like magnitude comparison, sign detection, and division.
To alleviate these drawbacks, the diagonal function was proposed by Dimauro et al.
However, in [22] , we have shown that any implementation involving the diagonal function proposed to date actually
results in excessive hardware overhead and delay, which make it impractical from the application view, so that it cannot compete with more traditional approaches.
Management of Dynamically Reconfigurable Systems
Participants : Antoine Eiche, Daniel Chillet, Sébastien Pillement, Ludovic Devaux, Olivier Sentieys.
To support the dynamic behavior of new embedded applications, heterogeneous execution resources are often included in modern SoC or MPSoC (Multi-Processor System-on-Chip) systems. The management of these resources is classically supported by an operating system (OS) that includes several specific services. One new needed service concerns the task scheduling and placement within the reconfigurable resources. The classical temporal scheduling problem is then extended with a spatial dimension in order to manage the physical available area into the reconfigurable resource. The second impacted service is the task communication management. The on-line task placement makes the interconnection support difficult to predict. Then, a flexible and dynamically interconnect medium must be defined.
Models for Dynamically Reconfigurable Systems
Participants : Daniel Chillet, Sébastien Pillement.
During the high-level design of the complete system, the designer must be able to choose between different architecture, application and operating system solutions. To support the exploration phase, the OverSoC project has proposed to develop a global methodology. In this context, we developed a first model of a dynamically reconfigurable architecture (DRA). Built using SystemC language, the model is modular and permits the fast evaluation of specific OS services for DRA management [64] . Based on this model, we implemented several services, such as a simple task placement, and evaluated several design parameters to qualify solutions [79] . The model is effective and was integrated in the OverSoC methodology.
Scheduling based on Artificial Neural Networks
Participants : Antoine Eiche, Daniel Chillet, Sébastien Pillement, Olivier Sentieys.
During this year, we continued our work on scheduling through Artificial Neural Networks (ANN) and we compared classical scheduling algorithms (e.g. PFair) and our ANN structure composed of inhibitor neurons. We have demonstrated that our model can manage heterogeneous multiprocessor architectures while classical scheduling solutions are only applicable for homogeneous multiprocessors [28] , [16] . Our scheduling was extended with task placement on heterogeneous reconfigurable execution resource by defining a spatio-temporal scheduling composed of two steps. The first step is the time scheduling under a resource placement constraint. The second step is the task placement with a real model of the possible instances of each application task. These two steps are solved by two different neural networks which can be evaluated in parallel [72] . A hardware structure of our neural network has been developed for the temporal scheduler and shows that hardware implementation is very efficient and can be a very good candidate for hardware implementation of this service.
Flexible Communication Infrastructure
Participants : Daniel Chillet, Sébastien Pillement, Ludovic Devaux.
For task communication within reconfigurable resources, we defined a specific interconnection architecture adapted to dynamically and partially reconfiguration resources included into modern SoC. We defined a first hierarchical interconnect infrastructure and specified an RTL VHDL model of this solution. Furthermore, to evaluate our architectural proposal, we built a demonstrator platform which allows us to illustrate the reconfiguration concept of the communication network. This leads to the DRAFT network based on the fat-tree topology, specifically designed to support the communication constraints required by the dynamic reconfiguration [41] . DRAGOON, an automatic generator of DRAFT simulation and synthesis models, was also designed to evaluate various versions of the network. Thanks to DRAGOON, DRAFT has successfully been compared with most popular Network-on-Chip (NoC) topologies, like mesh and regular fat-tree [18] .
Fault-Tolerant Reconfigurable Systems
Participants : Stanislaw Piestrak, Sébastien Pillement, Manh Pham, Olivier Sentieys.
The use of reconfigurable hardware in critical applications like transportation and transaction systems is increasing rapidly. Undetected errors caused e.g. by radiation may result in fatal silent data corruption and unreproducible system crashes. Since it is virtually impossible to build devices which are free from faults, it is essential to embed some sort of fault-tolerance in such devices, which will enable them to work correctly even in the presence of faults. Since the past decade, a lot of research has been done to develop fault-tolerant reconfigurable systems on various granularity levels, although most of them have dealt with the lowest level such as offered by FPGAs.
In [45] , we have considered the possibility of implementing low-cost hardware techniques which would allow to tolerate temporary faults in the data-paths of coarse-grained reconfigurable architectures. Our goal was to use less hardware overhead than commonly used duplication or triplication methods. The proposed technique relies on concurrent error detection by using residue code modulo 3 and re-execution of the last operation, once an error is detected. Simulation results performed for the DART architecture developed at IRISA with all of its data-paths protected using residue code confirmed hardware savings of the proposed approach over duplication.
To cope with the high sensitivity of electronic devices to failures or soft errors, we also proposed a multiprocessor system on a dynamically reconfigurable architecture for the design of fault-tolerant systems. First we have proposed and designed a flexible communication model which ensures reliable communication. This work accomplished in the CIFAER project permits to switch from a communication protocol, by reconfiguring the reserved zone for the communication protocol, to a secondary one in order to mitigate communication errors. Some possibilities to integrate this dynamic platform into standardized automotive software infrastructure have also been introduced [62] .
In order to exploit the computational power and the flexibility of reconfigurable architectures, and at the same time to guarantee the correct functionality of the entire system, we proposed a fully dynamic MPSoC topology. In this system, all the processors can be dynamically reconfigured, moved or replaced in the system, hence providing fault-tolerant and self-repair capability [61] . A deep exploration of a standard design flow has been done to facilitate the design of this architecture using commercially available FPGAs.
Power Efficient Architectures
Coding Techniques Improving Reliability and Power Consumption for On-Chip Buses
Participants : Olivier Sentieys, Sébastien Pillement, Stanislaw Piestrak.
Interconnects are now considered as one of the bottlenecks in the design of system-on-chip (SoC) since they introduce delay and power consumption. To deal with this issue, data coding for interconnect power and timing optimization has been introduced. In today's SoCs these techniques are not efficient anymore due to encoding/decoding circuitry (a codec) complexity or to unrealistic published experimentations. Based on some realistic observations on interconnect delay and power estimation [2] , [36] , the spatial switching technique [37] , [38] , [17] is proposed and has been patented. It allows the reduction of delay and power consumption (including extra power consumption due to codecs) for on-chip buses. The concept of the technique is to detect all cross-transitions on adjacent wires and to decide if the adjacent wires are exchanged or not. Results show the spatial switching efficiency for different technologies and bus lengths. The power consumption reduction can reach up to 15% for a 5-mm bus and more if buses are longer and for future CMOS technologies.
Several coding techniques have been suggested to reduce both noise and wire power consumption in on-chip interconnections, like bus-invert coding, low-weight coding, and reduction of the voltage swing of the signal on the wire. Unfortunately, the latter involves reduced noise margin which might result in increased error rate. Recently, Berger-invert code has been suggested to protect communication channels against all asymmetric errors and to decrease power consumption. We have not only shown some inaccuracies of the approach proposed [23] , but also suggested a modified encoding scheme and a new design of codec [24] . Implementation results have shown that our approach leads to significant hardware savings and results in reduced error rate and power consumption.
Ultra Low-Power Architecture for Control-Oriented Applications in Wireless Sensor Nodes
Participants : Steven Derrien, Adeel Pasha, Olivier Sentieys.
This research work aims at developing ultra low-power SoC for wireless sensor nodes, as an alternative to existing approaches based low-power micro-controllers such as the Texas Instrument's MSP430. The proposed approach tries to reduce the power consumption by using a combination of hardware specialization and power gating techniques. In particular, we use the fact that typical WSN applications are generally modeled as a set of small to medium grain tasks that are implemented on low power microcontroller using light weight thread -like OS constructs.
Rather than implementing these tasks in software, we instead propose to map each of these tasks to their own specialized hardware structures that we call a hardware task . Such an hardware task consists of a minimalistic (and customized) data-path controlled by a finite state machine (FSM). By customizing each of these hardware implementations to their corresponding task, we expect to significantly reduce the dynamic power dissipated by the whole system. Besides, to circumvent the increase in static power caused by the possibly numerous hardware tasks implemented in the chip, we also propose to combine our approach with power gating , so as to supply power to a hardware task only when it needs to be executed. The first results that we obtained are very promising and have led to two publications [60] , [59] .
The work done in 2009 mainly consisted in providing a system level design flow for specifying these new type of architectures. In particular, we now have a fully automated design flow that can produce a VHDL model of a micro-task starting from a specification in C, using the Gecos compiler infrastructure. We have also started working on a Domain Specific Language (DSL) for specifying the System Level Architecture, and hope to have a fully operational flow by the beginning of 2010.
SoC Modeling and Prototyping on FPGA-based Systems
Participants : François Charot, Kevin Martin, Laurent Perraudeau, Charles Wagner.
Cairn participates in the SoCLib ANR project (see Section 7.8 for more information) whose goal is to build an open platform for modeling and simulation of multiprocessors system-on-chip (MP-SoC). As part of our participation in this project, we have developed simulation models of the Altera NIOSII processor and of the Altera interconnect (Avalon bus). These models and their associated wrappers now allow NIOSII(The NiosII processor core is a configurable processor core proposed by Altera. This NiosII processor core is declined in three families (economic, standard, fast). A SoCLib model of the fast version has been previously developed in 2008.)-based multiprocessor systems to be modeled.
MutekH is a portable operating system developed at LIP6 laboratory. MutekH is a set of libraries built on top of the Hexo exo-kernel. This exo-kernel defines the Hardware Abstraction Layer, providing both portability and support for heterogeneity. This year, as part of our participation to the SocLib project, we have ported Hexo on NIOSII processor based MPSoCs architectures modeled with SoCLib.
In order to validate these different components, a multithreaded version of a H264 video decoding application has been ported on a SoCLib platform composed of several NIOSII processors communicating through the Avalon interconnect structure.