Section: New Results
Dynamically and Heterogeneous Reconfigurable Platforms
New Reconfigurable Architectures
Flexible Arithmetic Operator Design
Our aim is to propose new flexible arithmetic operators in term of accuracy. To optimize fixed-point implementations, architectures must offer operators which support different data word-lengths. Operator efficiency can be increased using subword parallelism (SWP) scheme. A single SWP instruction performs the same operation on multiple sets of subwords in parallel using SWP operators. In the existing SWP capable processors, the choices for subword data sizes are usually 8, 16, 32 bits etc. The reason behind the selection of these subword sizes being the less complexity of SWP operator design especially when subword sizes are multiple of the smallest subword size. However in multimedia applications, the input data (pixels) for computations is 8, 10, 12 and sometimes 16 bits. These multimedia data sizes are not in coordination with existing processor's subwords sizes resulting in the under utilization of processors resources. We designed SWP versions of some basic operators (add, absolute, multiply and multiply accumulate MAC) which can support multimedia oriented subword sizes (8, 10, 12 and 16)  . Subsequently, these basic operators will be used to implement more complex multimedia operators according to the user requirements.
Adaptive and Multi-mode Devices
In a mobile society, more and more devices need to continuously adapt to changing environments that is to say devices will have to be flexible to implement different algorithms at different times. Such mode switches require more than just software based changes but also adaptation of the application specific hardware components. To issue this requirement, we investigate two ways. The first one is the design of a reconfigurable processor able to adapt its computing structure to a dedicated domain: video and image processing applications. The processor is built around a pipeline of coarse grain reconfigurable operators exhibiting a good trade-off between performance and power consumption. On the contrary of what has been done in previous reconfigurable processors, flexibility is not obtained through the use of a flexible interconnect network but on the use of configurable domain-dedicated units. This work is done in the context of the ROMA ANR project. We particularly investigate reconfigurable operator design and compilation framework. The second way is multi-mode architecture design which does not lead to any reconfiguration time penalty. Such architectures implement all required operators according to the pre-defined set of computations to be performed. In order to optimize area these operators are shared between the set of algorithms and control logic steers the data to operators depending on the particular algorithm to be executed at a specific time. Area overhead depends on algorithm matching (too different algorithms or performance constraints can not lead to architectures with efficiently shared operators). Targeted domains are typically channel encoding, cryptography and multimedia. This work is done through a collaboration with IMS Lab. (B. Le Gal).
Memory Hierarchy in Specialized SoC
For several years, memory area in SoC architectures has increased strongly. Today, circuit designers define SoC with an ever-increasing number of memory banks to store large amount of data. These banks are organized into a multi-level embedded memory hierarchy to ensure high-performance. However, due to the weak activity of the memory and according to the number of transistors, memory power consumption, and especially static power consumption, represents a major part of the global SoC power.
In this context, we have defined a reconfigurable memory hierarchy model suited to specialized SoC  . The organization is based on a multi-banked architecture which ensures high performance accesses to the data. Each processing element can be directly connected to one or several memory banks. These links can be local through multibus network, or global through complete crossbar module. These links can be reconfigured and the hierarchy can be tuned according to the applications needs. Each memory bank has its own address generator. These generators permit to produce all regular address sequences and can be compared to a very small and simple core processor which enables to produce irregular sequences through the execution of address generation programs.
To optimize the power consumption, the Dynamic Voltage Scaling (DVS) technique has been included in the control of the memory architecture. The memories can be placed into low-power modes according to data access constraints to save energy. The low-power modes are managed by a global controller which ensures the global application constraints.
Reconfigurable Architecture Description Language
Our research aims at defining a platform model for the definition of dynamically reconfigurable architectures and associated methods. The main objective is to have a unified and formal specification of the platform that can be efficiently exploited in retargetable compilation flows, and in automated back-end generators for simulation and synthesis. The model is defined to cover different models of architectures, from fpga s to networks of processors, through coarse-grained reconfigurable data-path.
This method allows to easily develop a new dynamically reconfigurable architecture based on computing resources and generic interconnection schemes, to explore performances and to validate the architecture by simulations at different levels of abstraction. The definition of the architecture is done with the help of a high-level architecture description based on the MAML language developed at the University of Erlangen-Nuremberg. The first part of the work realized has permitted to interconnect different kinds of computing resources (configurable logic blocks, reconfigurable functional units or processors) and to produce the required reconfiguration resources for an homogeneous reconfiguration process. Different architecture paradigms (FPGA, reconfigurable datapaths such as DART or regular parallel processor architectures such as WPPA) can thus be quickly modeled. The second part of this work consisted in the generation of the configuration controller, after analyzing the MAML specifications of the architecture and of the reconfiguration resources produced. This work leads to the development of the Mozaic framework. The tool is able to generate a reconfigurable platform and to explore some important parameters (reconfiguration costs and time, flexibility and size of interconnect, number of resources). The proposed reconfiguration paradigm for computing and interconnect resources has been optimized for very fast reconfiguration process, which is essential to reach the timing constraint required by today's applications. Implementation of a wireless receiver has been tested on various architectures generated by our tool and has shown the efficiency of our methodology applied to reconfigurable systems  .
Optimization of Reconfigurable Architecture Interconnection Organization
Participant : Christophe Wolinski.
We worked on the problems of the static optimization of area and reconfiguration time for communication networks of regular 2D reconfigurable processor array architectures. To solve these problems (a) jointly and (b) not for a single, but for a whole set of algorithms , a unique constraint programming approach has been applied. At the beginning we have introduced an abstract model for the minimization of the number of multiplexers  ,  . This model is limited and covers only mono-casting data transfers. In the next step we have proposed a new optimized formulation  that makes it possible to support multi-casting data transfers. Moreover, we have defined new cost functions that make it possible minimization of other communication network parameters, such as area as well as parallel and sequential reconfiguration time. The correctness of our approach was illustrated by applying our methodology to a concrete architecture, namely weakly programmable processor array (WPPA) developed at University of Erlangen-Nuremberg. This architecture belongs to a class of computer architectures that consists of an array of processing elements with reconfigurable interconnections and limited programming possibilities.
Dynamically Reconfigurable Systems Management
To ensure efficient execution of applications into SoC architectures, designers include heterogeneous execution resources in the same chip (e.g. processors, reconfigurable architectures, dedicated blocks). The management of the overall platform (including hardware support and tasks) is thus supported by an operating system (OS). With the introduction of flexible/reconfigurable resources in a SoC, some OS services have to be adapted. For example, we can cite two specific services which are strongly impacted by the presence of reconfiguration into the system. The first one is the task scheduling and allocation which has to take account of the availability of reconfigurable resources, and to allocate tasks on these resources. The classical temporal scheduling problem is then extended with a spatial dimension in order to manage the physical available area into the reconfigurable resource. The second impacted services is the task communication management. The on-line task placement makes the interconnection support difficult to predict. Then, a flexible and dynamically interconnect medium must be defined.
In order to evaluate the impact of reconfigurable architecture on OS services, we have first defined an UML model of the complete environment in the context of the OverSoC project. In this project, we have proposed the model of the reconfigurable part of the system  . This work leads to a new collaboration with the Triskell team from IRISA. This collaborative work aims at defining a meta-model of reconfigurable hardware in order to take advantage of the raise of abstraction.
Concerning the scheduling service, we defined a first Artificial Neural Networks (ANN) to ensure spatial and temporal placement of tasks within a heterogeneous multi-processor SoC   . This year, we have extended our first ANN proposal to take reconfigurability into account. We have thus defined a new structure, called Reconfigurable ANN (RANN), which allows to substantially reduce the number of neurons  . This model can handle any number of tasks which can be instantiated on the resources. A mathematical formulation of this RANN was proposed, and a simulation tool was developed. A correct scheduling is obtained with a small number of iterations and a reduced set of neurons. To complete this study, we have prototyped the hardware implementation of the neural network. Our results show that implementation is very efficient and can be a very good candidate for hardware implementation of this service.
Concerning the interconnection, we are currently working on a specific interconnection architecture. We have proposed structures which are well-suited for state-of-the-art dynamically reconfigurable chips. We defined a first hierarchical interconnect infrastructure and built a VHDL implementation of this solution. Furthermore, to evaluate our architectural proposal, we have defined a demonstrator platform which allows us to illustrate the reconfiguration concept of this particular functionality.
Power Efficient Architectures
Coding Technique Improving Delay and Power Consumption for On-Chip Buses
Interconnects are now considered as the bottleneck in the design of system-on-chip (SoC) since they introduce delay and power consumption. To deal with this issue, data coding for interconnect power and timing optimization has been introduced. In today's SoCs these techniques are not efficient anymore due to their codec complexity or to their unrealistic experimentations. Based on some realistic observations on interconnect delay and power estimation  ,  , the spatial switching technique  is proposed and has been patented  . It allows the reduction of delay and power consumption (including extra power consumption due to codecs) for on-chip buses. The concept of the technique is to detect all cross-transitions on adjacent wires and to decide if the adjacent wires are exchanged or not. Results show the spatial switching efficiency for different technologies and bus lengths. The power consumption reduction can reach up to 15% for a 5-mm bus and more if buses are longer and for future CMOS technologies.
Ultra Low-Power Architecture for Control-Oriented Applications in Wireless Sensor Nodes
This research work aims at developing ultra low-power SoC for wireless sensor nodes, as an alternative to existing approaches based low-power microcontrollers such as the Texas Instrument MSP430. The proposed approach tries to reduce the power consumption by using a combination of hardware specialization and power gating techniques. In particular, we use the fact that typical WSN application are generally modelled as a set of small to medium grain tasks that are implemented on low power microcontroller using light weight thread -like OS constructs.
Rather than implementing these tasks in software, we instead propose to map each of these tasks to their own specialized hardware structures that we call a hardware task . Such an hardware task consists of a minimalistic (and customized) datapath controlled by a finite state machine (FSM). By customizing each of these hardware implementations to their corresponding task, we expect to significantly reduce the dynamic power dissipated by the whole system. Besides, to circumvent the increase in static power caused by the possibly numerous hardware tasks implemented in the chip, we also propose to combine our approach with power gating , so as to supply power to a hardware task only when it needs to be executed. Encouraging preliminary results have been obtained, and the generation of these hardware tasks structures directly from C specification (using the Gecos framework) is now on the way.
SoC Modeling and Prototyping on FPGA-based Systems
CAIRN participates in the SoCLib ANR project (see Section 7.5 for more information) whose goal is to build an open platform for modeling and simulation of multiprocessors system-on-chip (MP-SoC). This year, as part of our participation to this project, we have proposed and developed a simulation model of the Altera interconnect (Avalon bus). This model and its associated wrappers now allow NIOS (The NiosII processor core is a configurable processor core proposed by Altera. This NiosII processor core is declined in three families (economic, standard, fast). A SoCLib model of the fast version has been developed last year.)-based multiprocessor systems to be modelled. In order to validate these different components, a multithreaded version of a motion-JPEG application has been ported on a SoCLib platform composed of several NIOS processors communicating through the avalon interconnect structure.
We have also developed a model of the TMS320C62 DSP processor from Texas Instruments. The developed model is in fact an instruction-set simulator of the TMS320C62 processor. It has been validated in the framework of SoCLib simulation platforms at the CABA simulation level.