Team R2D2

Reconfigurable and Retargetable Digital Devices

Rennes
# Table of contents

1. **Team**  
2. **Overall Objectives**  
   2.1. Introduction  
   2.2. Directions  
      2.2.1. New architectures and technologies  
      2.2.2. Modeling, synthesis and compilation targeting reconfigurable platforms  
      2.2.3. Study of applications  
3. **Scientific Foundations**  
   3.1. Panorama  
   3.2. New architectures and technologies  
      3.2.1. New reconfigurable architectures  
      3.2.2. Network on Chip design  
      3.2.3. Wireless sensor networks  
      3.2.4. Multiple-Valued Logic (MVL) architectures and circuits  
   3.3. Modeling, synthesis and compilation for reconfigurable platforms  
      3.3.1. Dedicated hardware accelerator synthesis  
      3.3.2. Processor modeling and flexible compilation  
      3.3.3. Floating-point to fixed-point conversion  
4. **Application Domains**  
   4.1. Panorama  
   4.2. Mobile telecommunications  
5. **Software**  
   5.1. Panorama  
   5.2. PolyLib  
   5.3. MMAlpha  
   5.4. BSS, BOOST  
6. **New Results**  
   6.1. New architectures and technologies  
      6.1.1. New organization of reconfigurable structures  
         6.1.1.1. DART reconfigurable architecture  
         6.1.1.2. Memory hierarchy in specialized SoC  
         6.1.1.3. RDISK: Reconfigurable DISK  
   6.1.2. NoC design using advanced mobile telecommunication techniques  
   6.1.3. Wireless sensor networks  
   6.2. Modeling, synthesis and compilation for reconfigurable platforms  
      6.2.1. Synthesis and compilation techniques  
         6.2.1.1. Automatic synthesis of optimized reconfigurable systems  
         6.2.1.2. Specialized microcontroller synthesis on FPGA  
         6.2.1.3. Derivation of efficient hardware/software interfaces of regular arrays  
         6.2.1.4. ARMOR architecture description language  
   6.2.2. Floating-point to fixed point transformation  
      6.2.2.1. Floating-point to fixed-point conversion methodology  
      6.2.2.2. Fixed-point accuracy evaluation for non-linear systems  
   6.2.3. Specialized SoC architecture modeling  
      6.2.3.1. System modelling for dynamically reconfigurable architectures  
      6.2.3.2. Modeling data-flow architectures using Alpha  
      6.2.3.3. Co-design with data-flow and polyhedral models
6.2.3.4. SoC modeling and prototyping on FPGA-based systems

6.3. Study of applications
  6.3.1. Radio-communication systems
    6.3.1.1. mobile communication systems prototyping (3G, MIMO)
  6.3.2. Content-based image retrieval hardware acceleration
  6.3.3. Intrusion detection system in hardware
  6.3.4. Noise reduction in speech processing
  6.3.5. Intelligent transport system (ITS)

7. Contracts and Grants with Industry
  7.1. OSGAR (2003-2005)

8. Other Grants and Activities
  8.1. National initiatives
    8.1.1. ReMiX: Reconfigurable Memory for Indexing Huge Amount of Data
  8.2. International bilateral relations
    8.2.1. Europe
    8.2.2. Africa
    8.2.3. North America
  8.3. Visiting scientists

9. Dissemination
  9.1. Activities in the scientific community
  9.2. Teaching and responsibilities

10. Bibliography
1. Team

The team R2D2 is located on two sites: Rennes and Lannion.

Head of team
François Charot [CR INRIA]
Olivier Sentieys [Professor, University of Rennes 1, Enssat]

Administrative assistant
Danielle Graviou [since 09/01/05, Enssat]
Orlane Kuligowski [until 09/01/05, Enssat]
Lydie Mabil [TR INRIA]

Staff member (INRIA)
Daniel Chillet [Associate professor, Enssat, on Inria secondment]

Faculty members
Imène Benkermi [Lecturer until 08/31/05, Enssat]
Olivier Berder [Associate professor, Enssat]
Didier Demigny [Professor, IUT Lannion]
Steven Derrien [Associate professor, Ifsic]
Hélène Dubois [Associate professor, Enssat]
Michel Guittton [Associate professor, Enssat]
Nicolas Hervé [Lecturer until 08/31/05, Enssat]
Ekué Kini-Boh [Lecturer, Enssat]
Ludovic L’Hours [Lecturer since 10/01/05, Irisa-Rennes]
Daniel Menard [Associate professor, Enssat]
Laurent Perraudau [Associate professor, Ifsic]
Sébastien Pillement [Associate professor, IUT Lannion]
Patrice Quinton [Professor, Ifsic, Director of the Brittany branch of the ENS de Cachan]
Pascal Scalart [Professor, Enssat]
Christophe Wolinski [Professor, Ifsic]

Staff member (CNRS)
Charles Wagner [IR CNRS ASCII]

Technical staff, INRIA
Gilles Georges [Symbiose]

Ph.D. students
Georges Adouko [INRIA grant since 05/11/05, Irisa-Rennes]
Faten Benabdallah [Tunisian grant, Irisa-Lannion]
Mickaël Cartron [Brittany Region grant, Irisa-Lannion]
Anne-Marie Chana [SARIMA grant since 04/15/05, co-supervision with Yaoundé I University - Cameroon, Irisa-Rennes]
Stéphane Chevobbe [CEA grant until 09/30/05]
Nicolas Hervé [University grant since 09/01/05, Irisa-Lannion]
Thomas Guihal [Brittany Region - CEA grant since 01/01/05, Irisa-Lannion]
Erwan Grasse [CEA - university grant since 10/01/05 Irisa-Lannion]
Julien Lallet [University grant since 10/01/05 Irisa-Lannion]
Ludovic L’Hours [MENRT grant until 09/30/04, Irisa-Rennes]
Auguste Noumsi [SARIMA grant, co-supervision with Yaoundé I University - Cameroon, Irisa-Rennes]
Madeleine Nyamsi [INRIA grant until 05/10/05, Irisa-Rennes]
Jean-Marc Philippe [Brittany Region - University grant until 09/30/05, Irisa-Lannion]
Romuald Rocher [MENRT grant, Irisa-Lannion]
2. Overall Objectives

2.1. Introduction

The problems tackled by the team R2D2 relate to the design of specialized systems on reconfigurable platforms. A hardware platform is a structure of Integrated Circuits (IC) containing a set of programmable components—general purpose or specific processor cores—, memories and generally specialized components. Such a platform can be seen as an integrated architecture scheme, common to numerous algorithms belonging to a given application domain. This notion is the answer given by the designers of embedded systems to the increasing difficulty they have to implement their applications [63]. One can consequently imagine that in the future, most of the ICs necessary to the design of a complex system will be derived from a given existing platform. This design approach is an alternative to the IP-based (Intellectual Property) design approach, in which the system is built by assembling of separately designed components. A reconfigurable platform includes a set of reconfigurable components (blocks of reconfigurable logic, reconfigurable data-path, flexible communication networks). In terms of area and power consumption, the reconfigurable resources enable a far more efficient use of the silicon than in programmable processors or in specialized components.

Future platforms will be highly parallel, heterogeneous, programmable and reconfigurable. Parallelism is the only way of reaching the performance level required by future applications. Heterogeneity results from the report that an efficient design is often composed of several subsystems, characterized by well-differentiated computation requirements. Programmability avoids freezing the functionalities. Finally, reconfigurability combines the speed of specialized solutions and the flexibility of traditional programmable components.

Our scientific objectives seek to profit from various methods (very high-level synthesis, behavioral synthesis, flexible compilation, floating-point to fixed-point conversion, etc.), contributing each one with its specificities, to the design of a part of a specialized system. The models and the underlying techniques allow the use of estimators, thus contributing to the choices of implementation, with a precise knowledge of the performance of the system, of its complexity and its power consumption.

2.2. Directions

Research undertaken within the team R2D2 aims at facilitating the design of reconfigurable hardware systems, by proposing models of architectures and associated design methodologies which favor the adequacy between the algorithms of the applications and the architectures supporting the implementation. The team links together three main directions.

2.2.1. New architectures and technologies

Our studies, motivated by the constraints of high-performance, flexibility, and low-power consumption, focus on the following topics:

- the study of new organizations of reconfigurable structures offering the speed of specialized solutions and the flexibility of traditional programmable components with regards to application areas like mobile telecommunications;
- the application of advanced mobile telecommunication techniques to the design of Network-on-Chip (NoC);
- the study of architectures for low-power sensor networks;
- the study of Multiple-Valued Logic (MVL) circuits and architectures.
2.2.2. Modeling, synthesis and compilation targeting reconfigurable platforms

The implementation of an application on a reconfigurable platform requires the setting up of a large set of techniques which contribute, by successive refinements, to the implementation choices of the various parts of the application on the components of the platform. Our studies focus on the following aspects: SoC modeling, synthesis of dedicated hardware accelerators, processor modeling and flexible compilation, floating-point to fixed-point conversion.

2.2.3. Study of applications

Our privileged field of applications is that of third and fourth-generation mobile telecommunications. Moreover other application domains are considered: cryptography and traffic filtering in high-speed networks, image indexing, speech processing. The work concerns the prototyping of applications on reconfigurable and programmable platforms.

3. Scientific Foundations

3.1. Panorama

R2D2 research activities are based on work resulting from two scientific communities whose competences are complementary for the design of hardware systems: the first relates to methods and tools for specialized architecture design and the second concerns signal processing and dedicated circuit architectures. We start with an outline presenting the evolution of specialized architectures. We then give some bases of our research.

3.2. New architectures and technologies

**Keywords:** Network-on-chip, SoC, grain of calculation, low-power consumption, multiple-valued logic, reconfigurable architecture, sensor network.

By the end of the decade, IC technology should allow billion transistors chips to be fabricated, instead of few tens of millions today as illustrated by the document published by the SIA' (Semiconductors Industry Association). The hardware systems of the future equipments will be miniaturized – one now usually speaks about System-on-Chip (SoC) – while mixing architectures which will be highly heterogeneous and will include dedicated hardware accelerators.

Even if electronic CAD tools and associated design methodologies progressed much during last years, the design of new ICs is therefore not easier today. On the contrary, the distance between the capacities offered by the technology and the potential of the current design tools – the famous technology gap, – was never as large. A rather fundamental change in the way of designing circuits is noticed.

This evolution of the technology has an impact on the architectures of ICs. With the years, a migration is noted: from ASIC towards SoC, and in an immediate future towards reconfigurable programmable platforms.

- ASIC were prevalent between 1980 and 1995, and from now on are only used as particular blocks in more complex heterogeneous systems.
- The first SoCs were designed around 1995. Thanks to the increasing density of chips, a complex SoC usually integrates one or more processor cores (general purpose processor or digital signal processor), memory blocks (RAM, ROM, flash memory, EPROM, etc.), as well as many different interfaces useful for the correct working of the system. They combine hardware and software components. Their design relies on the use of synthesis, place and route tools, and libraries of reusable components.

In the near future, SoC will evolve to platforms, which are structures of integrated architectures, common to a set of algorithms or applications belonging to the same field of applications. The design tools and methodologies must thus make it possible to design a specialized architectures starting from this basic architecture [71]. The platforms will allow the needs for a broader spectrum of applications to be satisfied, at the price of a reduction of the variety of designed circuits.

Associating flexibility with high-performance and energy efficiency, is a critical issue for embedded applications. This is particularly true for mobile applications. These three constraints are taken into consideration in our architecture studies.

3.2.1. New reconfigurable architectures

These last years saw the emergence of new reconfigurable architectures [60], which are an alternative to the traditional performance/flexibility compromise, conditioning the choice between purely hardware (ASIC) or purely software (programmable processor) solutions. For application domain like mobile telecommunications, three main constraints have to be combined: high-performance, low-power consumption and flexibility. Grain of computation, reconfiguration schemes, are open research topics.

As an example, the Pleiades [68] project is an architectural platform supporting several grains of calculations – logic operations are treated as effectively as the arithmetic operations, – designed in order to consume a minimum of energy whatever the level of required performance. However, this platform does not make it possible to support the set of constraints previously discussed because of the static feature of its reconfiguration which limits it to certain field of applications, the coding of words having been the support of the study.

In addition to these two examples, many reconfigurable architectures are based on FPGA-type circuits and the majority of them, such as GARP [61], NAPA [69], Chimaera [70], integrate a traditional programmable processor in charge of the sequencing of the treatments on the reconfigurable block. Other architectures such as Piperench [58] or RaPiD [51] can be reconfigured at a higher level, respectively at the operator and functional level. The concept of grain of calculation indeed constitutes an interesting and significant research subject. The majority of the FPGA circuits are fine grain since they can be reconfigured at the bit level, which contrasts with the way in which the programmable processors handle words (32-bit words for a number of them). When bit-level reconfiguration is not required by the application, coarse-grained structures must be built starting from the elementary blocks of the reconfigurable structure, which results in a over-cost of the circuit. To limit this over-cost, new coarse-grained reconfigurable architectures are proposed. It results in structures in which the elementary blocks correspond to arithmetic logic units, multipliers, memories, etc. In addition to Piperench and RaPiD already mentioned, the architectures Matrix [54] at MIT, MorphoSys [66] at the University of California at Irvine, can be quoted. And among the commercial realizations: the array of reconfigurable arithmetic logic units of Elixent 2, and the XPP processors of PACT3.

3.2.2. Network on Chip design

The rapid growth of device densities on silicon has made it possible to deploy complete systems (SoC) using validated IP blocks. The increasing number of blocks needed to integrate all the functions required by a complex application shows the limitations of the current solution which consists in having a common interconnection resource (a bus). Among those limitations stand the increasing noise sensibility and the scalability of the interconnection scheme. In order to control precisely the electrical and scalability parameters [52] of the interconnect, in-chip communications have to be organized. A new paradigm is rising to face the interconnect issue [50]. The Network on Chip (NoC) concept proposes to use well-defined network layers to build the interconnection scheme. It separates the communication process into three different layers which provide the other layers with services (error detection or correction, routing or packetizing for example). A NoC is dedicated to the reliable and efficient routing of information grouped in packets (with redundancy information, routing information, etc.).

2http://www.elixent.com/
3http://www.pactcorp.com/
Assuming that the voltage swing on wires will decrease in the next few years, the reliability of the physical layer will decrease. The challenge is to provide a reliable, efficient and low-power link to meet the requirements of future SoCs.

3.2.3. **Wireless sensor networks**

Wireless sensor networks are groups of sensors interconnected with each other through wireless links. The aim of these sensor networks is to collect information from the area and to relay it through the network. Since several years, research in telecommunication, wireless networks, and signal processing has focused on this topic that raises new challenges in wireless communication [75]. First, the autonomy or the lifetime of a sensor network must be very high, since the sensors can be integrated in concrete, in the soil or even in the body of living beings where the replacing of the batteries is impossible or difficult. Energy-scavenging techniques can be used for that purpose. Then, these networks have to be ad-hoc, so that they can self-organize and cope with local sensor breakdowns, for example when some sensors run out of power. Another important singularity is the fact that the data rate needed by the applications should be quite low, since the data does not have to be sent continuously, but only when changes occur. Many applications have been proposed, in miscellaneous domains of activities, e.g. in agriculture, building, bridges, transport, military applications, enemy monitoring, chemical and bacteriological monitoring, emergency after earthquakes. Many wireless systems already exist and are commercially successful. Their specifications have generally been developed in order to maximize the spectral efficiency. In sensor networks, the energy is more critical than the available spectrum. For these kind of applications we should rather maximize the power efficiency than the spectral efficiency. A communication system can be described functionally by dividing the processing in layers. The OSI (Open Systems Interconnection) model describes seven layers for the processing. The problem is that the design of a communication system cannot efficiently be done each layer separately because they are coupled to each other. It is not enough to make optimizations on each layer separately. That is why designing a power-efficient system must take into account this coupling, by making cross-layer optimizations [57]. For that reason, it is better to consider few layers. We worked with a fragmentation of the protocol stack in only two layers. The higher-level part includes the aims of OSI application, presentation, session, transport, and network levels. The lower-level part includes the aims of OSI data-link and physical levels. The lower-level part considers a transmission between two neighbor nodes and has to optimize the communication from this point of view. The higher-level part considers a transmission between generally distant applications, assuming that the lower-level communication used are energy-efficient. This fragmentation has already been used in [73], and can be justified by saying that networking issues are coupled together only in the higher-layer part, while the channel management issues are coupled only in the lower-layer part.

3.2.4. **Multiple-Valued Logic (MVL) architectures and circuits**

Nowadays, numerical systems are exclusively based on a binary representation of numbers and computations. It was shown that the use of a higher number of logical states can reduce the number of interconnection wires and the memory area [56]. It also optimizes the arithmetic processing.

ICs performances are limited by complex wiring – a great amount of the chip performance is devoted to interconnection–, large propagation delay and high-power consumption. Using Multiple-Valued Logic (MVL) techniques, the amount of interconnections and the power consumption caused by important switching activity on each node of a circuit can be reduced. The SUpplementary Symmetrical LOgic Circuit structure (SUSLOC) is a new promising approach for the implementation of MVL functions in voltage-mode. It combines low-energy consumption and a speed equivalent to binary CMOS structures.

3.3. **Modeling, synthesis and compilation for reconfigurable platforms**

**Keywords:** ASIP, IC, architecture description language, data coding, design methodology, fixed-point arithmetic, flexible compilation, high-level synthesis, parallel architecture, precision, retargetable compilation, specialized processor.
3.3.1. Dedicated hardware accelerator synthesis

Although the architecture of ICs evolves to increasingly programmable and reconfigurable solutions, future silicon systems will continue to integrate specialized hardware components. The design of such components rests on the use of synthesis techniques.

Today circuits synthesis starts from high-level specifications. The specification of programs carrying out regular computations in the form of recurrence equations allows powerful static analyses and transformations of programs for the derivation of regular architectures [4].

The base of our research is the polyhedral model, which is well-suited to the expression of the calculation parts applications and which allows the expression and the handling of systems of recurrence equations.

There exist many academic environments prototypes for the automatic synthesis of specialized architectures starting from high-level specification: for example, Diastol, Presage, Hifi, Cathedral, Sade, PEI and MMAlpha. Tools performing a high-level synthesis from the C language now exist on the market: tools based on SystemC like CoCentric SystemC Compiler® of Synopsys, A|RT Builder of Adelante Technologies/Frontier Design, tools based on C and its extensions as Celoxica DK1 Design Suite® of Celoxica.

Few tools rest on a true parallelization but many research projects explore this approach: Flex® and Raw® at MIT, Piperench® at Carnegie-Mellon, Garp® at Berkeley, Pico® at HP Labs Palo Alto, Compaan® in Leiden.

Alpha [6] and MMAlpha, developed in the project-team Cosi, evolved from Diastol and constitute today a practical environment for the handling of recurrence equations and the high-level synthesis of dedicated hardware accelerators. The work is done in close cooperation with the CompSys team (LIP, ENS Lyon).

3.3.2. Processor modeling and flexible compilation

Hardware description languages like VHDL or Verilog are largely used to model and simulate processors, but mainly with the aim to design hardware. The design of SoC requires methodologies and tools for the exploration of the architecture design space. This exploration passes by the use of architecture description languages (ADL), adapted to the specification of the SoC architecture models. Very early in the design process, they play a role on the one hand for the validation of SoC architectures, and on the other hand for the automatic generation of the software development tools necessary to the software and hardware design of the architecture.

Most of the existing architecture description languages aim at the specification of processor architecture, by privileging either the synthesis, or the generation of compilers, or the generation of simulators, but very seldom the whole. None of the existing languages is really directed towards architectural exploration.

In the category of architecture description languages mainly directed towards processor hardware synthesis, one can quote Mimola, developed at the university of Dortmund, and used to describe target machines in the MSSQ and Record [65] compilers. Mimola is very close to hardware description languages like VHDL or Verilog. A Mimola description can be employed for the synthesis, simulation, and code generation, after extraction of the instruction set.

With regard to the architecture description languages mainly directed towards compilation, one can quote nML, designed at the university of Berlin, ISDL proposed by the MIT, MDES developed at the university of Illinois, Expression developed at the University of California at Irvine.

With regard to the architecture description languages mainly directed towards simulation, one can quote LISA [67], developed at the university of Aachen. LISA allows the generation of cycle-accurate simulators for DSP processors. Both the structure and the behavior can be modeled.

---

4 http://www.systemc.org
5 http://www.synopsys.com/products/cocentric_studio/
6 http://www.celoxica.com/products/tools/dk.asp
7 http://flex-compiler.lcs.mit.edu
8 http://cag.lcs.mit.edu/raw
9 http://www.ece.cmu.edu/research/piperench/
10 http://brass.cs.berkeley.edu/garp.html
11 http://www.liacs.nl/~cserc/compaan/index.html
The existing architecture description languages can be classified according to the modeling level: behavioral or structural. A language like Mimola is of structural level, languages like nML and ISDL are of behavioral level. LISA, Expression and MDES mixes the two levels of modeling.

There is no standard as regards architecture description languages. The ARMOR language developed in the project-team Cosi, constitutes a practical approach for the modeling of complex architectures. It is suited to architectural exploration and automatic generation of software development tools (compiler, simulator, processor design tools, etc.)

3.3.3. Floating-point to fixed-point conversion

The efficient implementation of an algorithm on a specialized processor, such as for example a DSP (Digital Signal Processor) or an ASIP (Application Specific Instruction-set Processor), or on a hardware structure, such as an ASIC or a FPGA (Field Programmable Gate Array), requires for reasons related to cost, consumption or silicon area constraints, the use of fixed-point arithmetic, whereas the algorithms are usually specified in floating-point arithmetic. This conversion is a tiresome task and error-prone if it is carried out manually. Indeed, some experiments\[59\] showed that the time devoted to this conversion step is relatively significant, manual conversion being able to represent up to 30% of the total time necessary to the implementation of the algorithm. Let us note in addition that the time-to-market constraint requires the use of high-level development tools, allowing to automate certain tasks.

The existing methodologies for fixed-point data automatic coding\[64\],\[74\] carry out a transformation from floating-point data representation into a fixed-point representation, without taking into account the architecture of the target processor. However the analysis of the influence of the architecture on the precision of computation and the various phases of the code generation shows the need for taking the architecture features into account and for coupling the coding and code generation processes to obtain an implementation of quality in terms of precision of calculations and execution time.

Data coding optimization must be carried out under precision constraint, and it is thus necessary to determine the signal-to-quantization noise ratio (SQNR) of the application. The SQNR determination methods\[62\] are generally based on simulation. But within the framework of the data coding optimization these methods use an iterative process leading to high times of optimization. The study of analytical techniques offers new perspectives for the accuracy evaluation.

4. Application Domains

4.1. Panorama

The privileged field of applications is that of third- and fourth-generation mobile telecommunications.

According to the cooperations, other application domains are therefore considered: image indexing, traffic filtering in high-speed networks, and speech processing.

4.2. Mobile telecommunications

The future generations of telecommunications constitute a privileged field of applications for IC designers because of the diversity of the constraints to satisfy. In addition to the very high-level of performance – superior to 12 billion operations per second – resulting from the association of multimedia capacities and access techniques such as the WCDMA which these systems will have to support (known as 3G), is added the need for supporting the whole of the algorithms integrated into the standards of present generations (GSM, DECT, IS-95) and their evolutions.

From the point of view of hardware architectures, the next generation systems will have successively to deal with very different applications. Indeed, the common tasks in a third-generation communication chain handle variable data sizes according to distance separating the task from the transmitter or the receiver, – the application tasks handle data of high-granularity such as images whereas the tasks giving access to the transmission operate on bit-level data. Because of the importance of the application spectrum integrated
into the future telecommunication standards, the treatments to be applied to these data will also be very diversified, which will result in very different calculation patterns. Even if each one of these constraints can be supported, the problem is much more delicate when they are combined, the time-to-market constraints impose the definition of development tools as portable as effective. In case of energy-aware products – lower than 500mW in peak, – this problem is insolvable if one limits oneself to the current architectural solutions.

5. Software

5.1. Panorama

**Keywords:** library, polyhedral computation.

Research undertaken by R2D2 is in the context of software and hardware tools for the design of hardware systems. In order to promote the studied techniques, several software prototypes are developed (Polylib, MMApLpha, BSS, ARMOR/CALIFE). Among those, three distributed software are presented: Polylib an open source library of calculation on polyhedron, MMAlpH for the high-level synthesis and BSS a platform for the design of circuits.

5.2. PolyLib

**Keywords:** ASIC, CAD, architecture synthesis, data parallelism, functional programming, polyhedral computation.

**Participants:** Patrice Quinton [contact], Tanguy Risset [CompSys, INRIA Rhône-Alpes].

The polyhedral Polylib library, developed in C, is an open source library of calculation on convex polyhedron. It was developed initially by Hervé Le Verge and Doran Wilde at INRIA Rennes. It is today maintained and developed with the LIP (ENS Lyon) and the ICPS of the university of Strasbourg. The handling of the domains used in the recurrence equations or spaces of indices described by nested loops justifies the use of such a library. This library is currently used (independently of MMAlpH) by several research organizations (in England, the United States, the Netherlands, and in France).

To know some more, refer to [http://www.irisa.fr/polylib](http://www.irisa.fr/polylib) or contact Patrice Quinton.

5.3. MMAlpH

**Keywords:** ASIC, CAD, architecture synthesis, data parallelism, functional programming.

**Participants:** Patrice Quinton [contact], Tanguy Risset [CompSys, INRIA Rhône-Alpes].

MMAlpH is a software which implements transformations on the Alpha language. The Alpha language was proposed by Christophe Mauras during his thesis in 1989. The implementation is carried out in the Mathematica language (from where the name MMAlpH) and is built on the Polylib library.

Alpha program transformations are implemented by combining the Mathematica language and the Polylib library. The principle is to derive either an architecture, a sequential or a parallel code starting from an algorithmic specification of a problem. These transformations are semi-automatic, i.e. the actions to be performed are indicated by the user but the transformation itself is carried out by MMAlpH. Automatic transformations are also available, and provide in some cases satisfactory results.

The design methodology is inherited from the method of systolic array synthesis. This field is studied from the theoretical point of view, and results of these research are implemented and experimented in the MMAlpH software. This software makes it possible to test various existing synthesis strategies, to study various possibilities of parallelization and to generate an architectural description of a circuit thanks to the AlpHard format (subset of the Alpha language). The interface between MMAlpH and logic synthesis tools is done thanks to a translation towards VHDL.

The software was the implementation support of many theses carried out at Irisa. It is used by several research teams within the framework of collaborations with R2D2. It is one of the only tools making it
possible to describe an algorithm and its hardware implementation in the same language and to deduce this implementation with proven transformations.

To know some more, refer to http://www.irisa.fr/R2D2/ALPHA/ or contact Patrice Quinton.

5.4. BSS, BOOST

Keywords: architecture synthesis, circuit design, low-power consumption, placement.

Participants: Daniel Chillet [contact], Sébastien Pillement, Olivier Sentieys.

The BSS (Breizh Synthesis System) software platform for circuit design proposes a set of tools for the capture of application description (in VHDL or in C), the compilation, the simulation and the synthesis of architecture.

The platform is currently composed of the following modules.

• A set of programs (C and VHDL compilers, selection, scheduling, code generation) allowing the synthesis of circuits.

• Graphic interfaces, PUDesigner and GFDesigner, allowing the visualization and the handling of the data flow graphs and architectures.

• A tool for power estimation at the architectural level, PowerCheck, operating from the architectures generated by the synthesis. It also uses as an input a file of parameters which makes it possible to characterize the technology of the circuit and the physical capacities of the chips. The signal can be specified in two different ways: either by its probabilities according to a model (white noise, DBT), or in the form of a file of vectors from which are extracted the probabilistic characteristics. As output, PowerCheck provides a report indicating the average powers dissipated by each part of the control and processing units. PowerCheck also gives the dissipated powers cycle by cycle by the various modules.

• A tool for area and delay interconnection estimation, Jfloorplanner, operating at the architectural level. The input of the tool consists of a netlist generated by BSS. This netlist contains the whole of information related to the components and their interconnections. The tool provides indications concerning the final area of the floorplan, the length of the interconnections as well as the interconnection delays related to these lengths. A display of the estimated floorplan is available and can be used in order to carry out quickly the place and route step with standard CAO tools.

BOOST (Breizh Object Oriented Synthesis Tools) is an evolution of the BSS platform whose main objective is to facilitate the integration of new modules in the synthesis flow.

A global XML application defines the module list and the installation location. For each module, an XML application defines how the module has to be described to be included in the Boost platform. Several simple synthesis steps have been included in Boost. This platform was used as a demonstrator for the OSGAR project during the RNTL days in October 2004 in Rennes. Boost is developed in Java language and can be installed on solaris, windows or linux platforms.

To know some more, contact Daniel Chillet.
6. New Results

6.1. New architectures and technologies

Keywords: CDMA, MVL, Network-on-Chip, NoC, SoC, System-on-Chip, grain of calculation, low-power consumption, multiple-valued logic, reconfigurable architecture, sensor network.

6.1.1. New organization of reconfigurable structures

6.1.1.1. DART reconfigurable architecture

Participants: Sébastien Pillement, Julien Lallet, Olivier Sentieys.

The definition of the DART architecture led to the Ph.D. thesis of Raphael David in 2003 [3]. In order to validate the theoretical aspects and simulated performances of this new computation paradigm through a silicon prototype, a collaboration has started with the LIST laboratory of CEA. The aim of this joint research project is to integrate a DART cluster implementing the channel estimation in the 802.11a Wireless LAN norm. The algorithmic complexity of channel estimation is 1784 MOPS and 356 MDPS (million of division per second). A VHDL model of DART at the register-transfer level has been designed. It is compatible with the SystemC cycle-true bit-true simulator. The synthesis of a DART cluster including six reconfigurable data-paths and two dedicated dividers on a 130 nm CMOS technology from STMicroelectronics, leads to a 200 MHz clock frequency (i.e. 4800 32-bit MOPS plus 400 MDPS) for less than 10 square millimeters.

A new study concerning the definition of a platform-based reconfigurable architecture has started. The main goal is to define a generic architecture (based on the DART paradigm) supporting different applications domains. This study aims at alleviating the main DART drawbacks: its specificity to telecom domain.

6.1.1.2. Memory hierarchy in specialized SoC

Participants: Daniel Chillet, Olivier Sentieys, Erwan Grâce.

Our research aims at defining a global memory hierarchy model suited to SoC and a methodology which allows the designer to explore the design space. The main objective consists in limiting the energy consumption of the circuit.

SoC architectures already propose large on-chip memory, with several memory banks and memory hierarchal levels. In these systems, the main problem concerns the memory exploration in relation with the application needs and particularly the consumption problem of this part of circuit. Several problems could be addressed in this context, such as cache, scratch-pad, and multi-bank memory. We focus our research on designing methodologies for optimal memory hierarchies [32], [18]. A first model has been defined for dedicated SoC and for large reconfigurable architectures such as FPGA circuits. To address the problem of consumption, we propose to extend the reconfigurable concept to the memory part. Indeed, for the future technologies, the memory consumption will be the more critical problem and we have demonstrated that it will be possible to define a virtual memory hierarchy which allows a significant consumption optimization [19] to be obtained. This work is done through a collaboration with the École Nationale Polytechnique d’Alger (L. Abdelouel Ph.D.) and with the CEA.

6.1.1.3. RDISK: Reconfigurable DISK

Participants: Steven Derrien, Ludovic L’Hours.

The Reconfigurable DISK, is a joint project between the Symbiose and R2D2 teams, that has been funded by the French Research ministry during the last years.

Its goal is to develop a specialized architecture following the smart disk concept. The idea is to attach reconfigurable computation capabilities near the disk for providing on-the-fly data filtering to speed-up large database scanning. The target application field is genomic data extraction, and a 48 disk prototype is currently in use.

The team R2D2 is involved in the design, implementation and validation of the RDISK Programmable System-on-Chip, which is based upon a Xilinx FPGA. The goal was to provide a small foot-print (in terms
of resource usage) SoC. Among other tasks, this project included the design of a SoC bus arbiter, of a high-performance SDRAM controller and of an ATA/IDE hard drive controller. A significant amount of work has also been done on the design of a light-weight operating system layer, whose purpose is to handle the RDISK dynamic reconfiguration capability and to provide simple communication primitives between the host and the boards. All these contributions have been successfully tested on the system, and now serve as a framework for all other RDISK project participants.

6.1.2. NoC design using advanced mobile telecommunication techniques

Participants: Jean-Marc Philippe, Sébastien Pillement, Olivier Sentieys.

The increasing need of low-power and high-speed on-chip interconnect schemes lead us to investigate new signaling concepts. Among them stands the PAM (Pulse Amplitude Modulation) technique which consists in having multiple voltage levels encoded on a single wire.

We have designed four ternary encoders and one ternary decoder to be able to implement a low-power and low-area asynchronous signaling system. The foundry process is modified to meet the voltage threshold requirements of our transistors. We also have designed a quaternary link using custom transistors to overcome some of the interconnect problems. The quaternary link is composed of a binary-to-quaternary encoder and a quaternary-to-binary decoder. The encoder converts two binary signals into a quaternary one and the decoder converts back the quaternary signal into two binary ones.

The SPICE simulations of the circuits show a great improvement in terms of energy consumption for global interconnects. This is due to the reduction of the voltage swing of some transitions. The energy consumption reduction is about 56% with a 10mm wire for the asynchronous ternary system compared to a classical binary dual-rail scheme. This reduction is about 50% for the quaternary system and it consumes less energy even for a 1mm wire (compared to a full-swing binary system). Another advantage of this approach is that the transistor counts for the whole systems are very low. For example, it represents 22 transistors (10 for the encoder and 12 for the decoder) for one of the quaternary link. These links also reduce by two the number of wires needed to transmit the information. This enables us to increase the inter-wire distance to reduce the crosstalk noise. This contributes to the reduction of the interconnect area.

We have proposed an analytical energy consumption model for ternary and quaternary links. This model enables to predict the dynamic energy consumption of a complete link as a function of the wire length, electrical and technological parameters and statistic distribution of the binary inputs. We also have computed an error model for multiple-valued links.

The second topic of interest of our researches deals with crosstalk reduction. We used static coding schemes to improve the propagation delay of a bus by removing the worst case patterns. Our schemes permit an improvement of 50% of the propagation delay while being independent from the bus bitwidth. Another advantage of our schemes is that they unify the transition directions of the signals causing the crosstalk noise to become greater. This property enables us to shift the switching level of the receiver in order to improve the noise margin because we influence the global noise.

The last contribution deals with unified coding. We have developed an efficient coding scheme by using simple techniques to face crosstalk and noise issues. This technique can improve the propagation delay of the bus and the error tolerance by detecting errors in the transmission. We can also use the error tolerance to reduce the supply voltage on the link and thus dramatically reduce the power consumption.

This research is described in full details in the Ph.D. thesis of Jean-Marc Philippe [14].

6.1.3. Wireless sensor networks

Participants: Mickaël Cartron, Olivier Sentieys.

The aim of our research is to optimize the energy efficiency of a wireless sensor network at the architectural and algorithm levels. We modeled the behavior of a low-level communication system, from the physical level to the packet retransmission system. The target of the processing is a dedicated architecture (ASIC), because of its lower power consumption compared to microprocessors or FPGAs. We modeled the bit-error-rate performance of the communication system and its power consumption as a function of several parameters.
(noise power, distance, packet size, amplification level). From analytical expressions of the performance and power separately, we can deduce the value of the power with a constraint of performance, which can be expressed as the energy consumed per successfully transmitted useful bit. With the help of this expression, we have highlighted an optimal operating point as a function of the input parameters. Recent results have shown that the use of this configuration for the architecture and for the communication system parameters allows to save up to 75% of the power, compared to the worst case technique.

In addition to this work, we have worked on a sensor network prototype. The aim of this study is to realize a wireless ad-hoc sensor network with respect to some particular features. The design is very modular, to make the future evolutions easy. The prototype is based on an existing industrial platform, because our goal is mainly to study high-levels constraints, at application, network, or link level. The required memory is minimal. Mobility of the network is nil or very low, except for special nodes that can be mobile. The geographic position of nodes is a central information, and the identifier of a node is its geographic localization. Each node has a limited knowledge of the whole network, indeed, each node knows the positions of the nodes in its close neighborhood and the position of the base station.

The aim we have is to define the hardware of the system constituted only by strictly necessary hardware. The hardware should be defined to exactly fit a targeted application. Indeed, the challenge of this project is firstly to define precisely what is the desired application and what are the applicative needs (memory, latency, etc.), and then to deduce the architecture that fits these constraints the best.

A complete monitoring system has been implemented on autonomous and versatile wireless platforms designed by the Aphycare company, based on a Texas Instrument MSP430 microcontroller, a communication component Chipcon CC1020, and an RF programmable front end with an amplifier and a bandpass filter. We have declined the sensor network prototype for two types of applications (scenarios), then we have highlighted 5 transmission modes which we assume to cover all sensor networks communications schemes. This prototype is a powerful tool to study applications and networks aspects and to get valuable data for the evaluation of high-level applications parameters. It can help to define what is the actual transmission rate needed or what is the actual data volume to be transmitted for a specific application. This kind of data is crucial to make efficient cross-layer designs, and more particularly, to optimize the lower layer of the network protocol.

6.2. Modeling, synthesis and compilation for reconfigurable platforms

Keywords: architecture modeling ASIP design, architecture synthesis, communication, fixed-point arithmetic, flexible compilation, reconfigurable system, scheduler, synthesis, system on-chip.

6.2.1. Synthesis and compilation techniques

6.2.1.1. Automatic synthesis of optimized reconfigurable systems

Participant: Christophe Wolinski.

This year we have continued investigating the problem of optimized Fabric synthesis. In this context we have proposed a new method of generation of optimized architecture of hardware processes in the perspective of their later implementation on a "System on a Programmable Chip" (SoPC). The hardware processes are the applications tailored "cells" in the Processor-Coupled Polymorphous Fabric implemented on the reconfigurable SoPC platform. In order to obtain optimized high performances, pipelined architecture, each process implementing a repetitive conditional behavior with possible inter-iteration dependencies was scheduled under hardware resource constraints. The scheduling problem was defined and solved using a constraints programming approach. This approach made it possible to obtain optimal solutions in terms of execution time and registers quantities for a number of real cases.

We have applied the proposed method to many different applications. One of them was a part of the "CORDIC" application. We have implemented the final design on a reconfigurable platform that proved the feasibility of our approach. Optimal schedules were achieved for many of the tested applications.

---

12 Aphycare is a spin-off from R2D2 team, http://www.aphycare.com/indexgb.html
13 This work is an extension of research on an automatic optimized reconfigurable system synthesis undertaken at Los Alamos National Laboratory, USA.
As a result of our research, the automatic "FAbric cell Synthesis Tool" (FAST) was developed.

6.2.1.2. Specialized microcontroller synthesis on FPGA

**Participant:** Ludovic L’Hours.

This research aims at developing techniques to synthesize specialized microcontrollers on FPGA. The targeted applications are described in a high-level language such as C, where control strongly prevails (peripheral control driver, packet processing, etc), such as the operating system embedded into the RDisk machine [22]. The main goal is to get small-sized circuits, with reasonable performances. The traditional approaches of architecture synthesis generally aim at maximizing the performance by analyzing and paralleling the data flow, there are not suited to applications with complex control flow. Their intrinsic sequential feature naturally leads to use software compilation techniques.

We designed a microcontroller synthesis technique based on the extraction of a specialized instruction set from a given application. The design of this instruction set is mainly leaded by application profiling information, but also by different estimators such as the pattern complexity or the number of bus access. The microcontroller is then derived from this instruction set using VHDL templates: the targeted architecture is currently a RISC processor, but other kind of architectures such as VLIW could be considered. Compared to fixed instruction set microprocessors, we managed to halve, both the code size and the processor size, for the same range of performances [41].

Some works are currently undertaken 1) to extend the extraction algorithm to DAGs instead of trees, in order to investigate more complex instruction patterns; 2) to adapt the VHDL templates to SystemC, so that the generated processors could be integrated in the SocLib platform (http://soclib.lip6.fr).

6.2.1.3. Derivation of efficient hardware/software interfaces of regular arrays

**Participants:** Steven Derrien, Alain Darte [CompSys INRIA Rhône-Alpes], Tanguy Risset [CompSys INRIA Rhône-Alpes].

Whenever a designer integrates a dedicated hardware accelerator (intellectual property or IP) in a SoC, he/she must implement an input/output protocol, composed of software and hardware parts. The software part is usually called the *driver*, the hardware part the *interface*.

While the high-level synthesis research community has focused on trying to derive efficient dedicated hardware accelerators from high-level specifications, the problem of generating automatically an efficient interface between these accelerators and the rest of the SoC has received only little attention.

However, as most designers can tell, such an interface is often the most tedious and error-prone part of a design and it has often a strong influence on the actual performance benefits provided by the hardware acceleration. This problem is even strengthened for stream processing applications: huge parallelism is present but can be ruined by an inefficient handling of data-stream communications.

We have approached this problem in the context of parallel processors arrays architecture (similar to those derived the the MMAlpha environment). In particular we proposed to formulate the problem as a classical resource-constrained problem, and thanks to recent optimization techniques [53], we were able to define conditions for obtaining a conflict-free schedule of input/output for multi-dimensional processor arrays (e.g., 2D grids) [34].

Since the schedule is static, it allows us to perform further optimizations such as grouping successive data in packets to operate in burst mode. A comparative approach (targeting FPGA technology) between our static schedule and a run-time congestion resolution has shown important gains in hardware area, while preserving the design clock period.

We are currently working on an extension of our hardware interface model that would take advantage of this static I/O schedule to allow data prefetching and buffered write techniques, combined with a custom scratch-pad memory.

6.2.1.4. ARMOR architecture description language

**Participant:** François Charot.
Our research aims at developing methods to model programmable processors through their instruction sets and tools to derive software development environments from these processor models. A processor description in ARMOR is a grammar whose each derivation is a possible behavior of the instruction set. ARMOR thus describes the behavior of the instruction set, including its semantics, temporal information, the use of the resources, as well as the possibilities of parallelism at the instruction level.

The future objective is to extend this work for the modelling of reconfigurable and specialized SoC architectures with the goal to exploit such models in retargetable compilation flows.

6.2.2. Floating-point to fixed point transformation

6.2.2.1. Floating-point to fixed-point conversion methodology

Participants: Daniel Menard, Nicolas Hervé, Daniel Chillet, Romuald Rocher, Olivier Sentieys.

An analytical-based methodology to automatically convert and optimize a floating-point application into a fixed-point architecture has been defined. Previous work was targeting DSP processors. The methodology now targets ASICs and FPGAs architectures [38].

The user has to specify for its application a time and an accuracy constraint (expressed as the minimum output signal to quantified noise ratio). Then the methodology will convert the application in fixed-point. For a DSP, it will generate the code optimized for the processor. For a FPGA or an ASIC it will search (based on a component library) the number of operators needed (for each type of operator) distribute the operations among the operators, and minimize the operators word-length so that the global cost (either area or power consumption) is minimum and the previous constraints respected.

This analytical-based and automatic methodology is limited to systems without feedback signals. For such systems, we are studying quality configurable IPs approach. Such IPs like the LMS and D-LMS have been developed. The objective will be to integrate these quality configurable IPs as subsystems into the global optimization methodology.

6.2.2.2. Fixed-point accuracy evaluation for non-linear systems

Participants: Daniel Menard, Romuald Rocher, Pascal Scalart, Olivier Sentieys.

An important part of the floating-point to fixed-point process is the fixed-point specification accuracy evaluation. The goal is to extend our previous works to obtain an analytical accuracy evaluation method for all kind of systems. More particularly, the adaptive systems are under consideration.

A general model has been developed for adaptive filter based on the gradient algorithm. The expression of the output quantization noise power has been proposed for the different variants of the LMS algorithm (classical LMS, NLMS, Leaky LMS) [46] and the APA algorithm [45]. The APA algorithm was introduced in 1984 but no fixed-point study has ever been published. This algorithm converges faster than the variants of the LMS algorithm and this convergence time is independent of the input signal statistical parameters.

Our work is focused on the study of the output quantization noise power for non-linear recursive systems. Two models have been proposed. The first model, which does not take into account the correlation between the signals, allows to obtain a simple analytical model. The second model, which is more complex and based on the recurrence unrolling, leads to more accurate results. Moreover, the second model presents the advantage to be more general and can be spread to all types of systems. A study including more complex operators such as sign operators is under development. Our models are valid for the different quantization laws (truncation and rounding). The model quality has been evaluated by comparing our estimation with the results obtained by simulations.

6.2.3. Specialized SoC architecture modeling

6.2.3.1. System modelling for dynamically reconfigurable architectures

Participants: Imène Benkermi, Didier Demigny, Daniel Chillet, Sébastien Pillement, Olivier Sentieys.

SoC platforms including dynamically reconfigurable units aim at supporting complex multimedia applications using a real-time operating system. They consist in different execution modules, i.e. general-purpose
processor(s) and specialized/reconfigurable accelerators including the DART dynamically reconfigurable architecture. Their heterogeneity led to a specification by means of the three following levels of description: software, middleware and hardware. Each level corresponds to a task configuration on the platform. Hence, the operating system has to ensure specific services imposed by the reconfigurable aspect; namely, ensuring task communication and migration between the three levels of description.

The software model describes the different tasks of the application supported by the architecture platform. A UML-based model has been used. In addition to the task characteristics, real-time constraints and links between tasks, this model permits to specify the different forms a task can have depending on the target it will eventually execute on. This work has been done through a collaboration with ETIS (Cergy-Pontoise) and LESTER (Lorient) laboratories.

After listing the most important algorithms for real-time scheduling on SoCs, especially those focusing on power consumption, we have proposed an approximate on-line scheduling algorithm based on Artificial Neural Networks (ANN). This scheduler is able to distribute the task set on the different computing units, while meeting their real-time constraints and taking into account their heterogeneity; i.e. one task may have different execution times depending on the unit it will execute on. Simulations of multimedia application on platform prototype based on these specification models are conducted. Extending this algorithm for power consumption considerations will be the next challenge.

6.2.3.2. Modeling data-flow architectures using Alpha

Participants: Madeleine Nyamsi, Patrice Quinton, François Charot, Charles Wagner.

Our research aims at developing methods and tools to synthesize parallel architectures for data-intensive applications expressed using the Alpha applicative language. These methods are implemented in the MMA software.

The Alpha language allows systems to be modeled using structured descriptions: some components can be separately represented, and later instantiated as an elementary block in a larger application. In many applications, these blocks have different clock rates, and it is the case for example, in the WCDMA (Wireless Code Division Multiple Access) air interface. We have been able to represent in Alpha multi-rate systems, by adding special components that model up- and down-samplers, and we have extended the structured scheduler of MMA in order to find out the rates of all elementary blocks as well as the detailed schedule of each block. This research is described in full details in the Ph.D. thesis of Madeleine Nyamsi [13].

6.2.3.3. Co-design with data-flow and polyhedral models

Participants: Anne-Marie Chana, Patrice Quinton.

This work is done in cooperation with the Espresso project-team. The context is the design of integrated circuits for multimedia applications using jointly data-flow and polyhedral models. The objective is to take benefit of both models in order to optimize systems containing both control aspects and intensive computations. The study relies on two modeling platforms : Polychony with Signal and MMA with Alpha.

6.2.3.4. SoC modeling and prototyping on FPGA-based systems

Participant: François Charot.

R2D2 participates in the SocLib initiative (http://soclib.lip6.fr), whose goal is to build an open platform for modeling and simulation of multi-processors system on chip. The core of the platform is a library of simulation models for virtual components (IP cores), with a guaranteed path to silicon.

Thanks to the increasing capacity of FPGA components, it is today possible to integrate a significant number of processors in a programmable FPGA component. Although it is possible to synthesize a processor core in such a component, the FPGA manufacturers design their own processor architectures. As an example, Altera propose the Nios processor core, declined in three families (economic, standard, fast). It is configurable processor core (pipeline depth, cache size, custom instructions, etc.).

As part of our participation, we study the feasibility of prototyping SocLib platforms using such FPGA components. To this end, we have developed a SocLib model of the Nios processor core from Altera. This
work is done with the goal to establish a link between a SocLib simulation platform and its prototyping on a FPGA system.

6.3. Study of applications

**Keywords:** WCDMA, biomedical, image indexing, intrusion detection in hardware, mobile telecommunication, speech processing.

Applications stemming from third-generation radio-communication systems are good candidates for the study of hardware systems mixing programmable parts executing software code and specialized modules dedicated to the acceleration of time consuming parts of applications.

Data filtering, cryptographic and traffic filtering in high-speed network, speech processing are also under consideration.

6.3.1. Radio-communication systems

6.3.1.1. Mobile communication systems prototyping (3G, MIMO)

**Participants:** François Charot, Olivier Berder, Michel Guittion, Daniel Ménard, Madeleine Nyamsi, Patrice Quinton, Taofik Saidi, Pascal Scalart, Olivier Sentiéys, Charles Wagner.

Our experiments rely on the use of the SignalMaster prototyping platform\[14\] that allows applications described using Simulink to be executed on a special-purpose board including a DSP processor and a FPGA chip (SignalMaster platform from Lyrtech Inc. company). Different implementations of the WCDMA emitter-receiver have been realized. This research is a preliminary step in the study of fast estimation techniques for the design of SoC and is described in full details in the Ph.D. thesis of Madeleine Nyamsi \[13\].

In the context of wireless communications, using more than one antenna both at the transmitter and at the receiver optimizes the spectral efficiency of data transmission. The high complexity of the MIMO (Multiple Input Multiple Output) technique leads to the design of real-time high-performance specific architectures. A 2x2 MIMO real-time prototype based on the WCDMA (Wideband Code Division Multiple Access) norm has been designed. It is compliant with the High-Speed Uplink Packet Access (HSUPA) technology which is defined as a possible extension of the UMTS Terrestrial Radio Access Network (UTRAN) uplink. This system is designed on a rapid prototyping platform from Lyrtech Inc. company, the SignalMaster platform, which is based on FPGA and DSP circuits. This work is done in collaboration with Lyrtech Inc. and with the LRTS laboratory of Laval university in Québec, CA.

6.3.2. Content-based image retrieval hardware acceleration

**Participants:** Steven Derrien, Auguste Noumsi, Patrice Quinton, Laurent Amsaleg [Texmex].

Content Based Image Retrieval (CBIR) is a technique that allows one to retrieve images of a data base which are (at least) partly similar to a given reference image. CBIR is drawing increasing interest due to its potential application to problems such as image copyright enforcement. Indeed, the large use of Internet resulted in a huge increase of Web available multimedia content, especially images. Checking copyright is therefore a concern for image owners which must be able to identify undue use of images. This identification process relies upon precise and fast image comparison algorithms as Internet is a rapidly changing support and such algorithms need to be run on a daily basis.

Although accurate search techniques based on local image descriptors exist, they suffer from very long execution time (retrieving an image among a 30,000 image data base requires about 1,500 seconds on a standard workstation). To make these techniques attractive, we have studied the possibility to accelerate CBIR through the use of a specific hardware design architecture, the target machine being the RDISK cluster \[22\].

In particular we have proposed some modifications to the original algorithm, so that it would better fit a hardware implementation. Among other issues, we have been looking at converting the original floating point based algorithm into a fixed point implementation and proposed to substitute the initial $L_2$ distance metric by

\[http://www.lyrtech.com/DSP-development/dsp_fpga/signalmaster.php\]
the simpler $L_1$. These transformations were not straight-forward since their impact on the quality of results of the search had to be clearly quantified.

We have therefore validated our approach on a 3 104 image database, and we have shown, that by using adequate 8 bits fixed point encoding, we could obtain an search accuracy similar to the one obtained for floating-point. While very useful in the context of an hardware implementation, it is to note that this result is also of interest to the image processing community, since it allows descriptors database size to be reduced by a factor of 4.

Using these results we have designed an hardware architecture targeted at FPGA technology. Its implementation on the RDISK platform is part of our ongoing work.

### 6.3.3. Intrusion detection system in hardware

**Participants:** Georges Adouko, François Charot.

The dynamic feature of security systems is – through anti-intrusion mechanisms (filtering at different levels: packet, connection, and application levels) evolving according to modes and levels of protection–, to our knowledge, a challenge out of reach of classical technologies based on general purpose or network processors. The requirements of security in high-speed networks (from 10 to 40 Gigabit/s) impose the implementation of the filtering rules in the appropriate hardware structures. It is a matter of being able to manage a large variety of complex treatments, and also to guarantee the quality of service. Only dedicated solutions could solve the bottleneck related to the implementation complexity today, at the price of an obvious lack of flexibility and a total absence of evolution.

The aim of our research is the design of specialized hardware systems for the filtering of the network traffic at high-speed. Even if the work especially concern the study of efficient and predictable filtering techniques and their implementation on FPGA programmable components, our approach rests on a system view of the intrusion detection system and envisions specialized systems combining software and hardware modules.

### 6.3.4. Noise reduction in speech processing

**Participant:** Pascal Scalart.

The problem of enhancing speech degraded by additive noise, when only a single observation is available is still an active field of research. Noise reduction is useful in many applications such as voice communication and automatic speech recognition where efficient noise reduction techniques are required. Today, efficient noise reduction techniques are mainly based on the estimation of a short time spectral gain, which is a function of the *a priori* Signal-to-Noise Ratio (SNR) and/or the *a posteriori* SNR. State of the art noise reduction systems estimate the *a priori* SNR thanks to the decision-directed (DD) approach proposed by Ephraïm and Malah [55]. However, when estimated with the DD approach, it has been demonstrated that the *a priori* SNR follows the shape of the *a posteriori* SNR with a frame delay. Consequently, since the spectral gain depends on the *a priori* SNR, it does not match the current frame and thus the performance of the noise suppression system is degraded.

To refine the estimation of the *a priori* SNR, we have recently proposed a new method, called Two-Step Noise Reduction (TSNR) which removes the drawbacks of the DD approach while maintaining its main advantage, *i.e.* a highly reduced musical noise level. The major advantage of this approach relies on the suppression of the frame delay bias leading to the cancellation of the annoying reverberation effect characteristic of the DD approach. Moreover, when using classical short-time suppression techniques (including the TSNR), some harmonics are considered as noise only components and consequently are suppressed by the noise reduction process. Such limitation is due to the noise power spectrum estimation which is a very difficult task for single channel noise reduction techniques. To overcome this limitation, we have proposed a second method, called Harmonic Regeneration Noise Reduction (HRNR), that takes into account the harmonic characteristic of the input signal [44]. When the SNR is low, classical techniques (including the TSNR) suffer from harmonic distortions mainly due to estimation errors introduced by the noise PSD estimator. To solve this problem, a non-linearity is used to regenerate the degraded harmonics of the distorted signal in an efficient way. The resulting artificial signal helps to refine the *a priori* SNR which is then used to compute a spectral gain that
preserves speech harmonics, and hence avoids distortions. The role of the non linearity and the principle of harmonic regeneration are detailed and analyzed in [25]. Experimental results demonstrate the good performance of the HRNR technique in terms of objective and subjective results and confirm the significant performance improvement brought by the HRNR technique.

6.3.5. Intelligent transport system (ITS)

Participants: Olivier Berder, Daniel Ménard, Olivier Sentieys.

Transportation systems are playing a critical role in virtually all facets of modern life and significant challenges remain to further improve the efficiency and safety of the current systems. The Brittany Region Council and the Côtes d’Armor Department Council are actually investing in this research area and created recently a Scientific Interest Group on Intelligent Transportation System (ITS), whose head is at ENSSAT, Lannion. Our research team actively participates to this new activity, and especially to projects concerning the deployment of new energy-efficient architectures for ITS.

Considering low-cost transmission systems embedded in road signs, drivers will be able to receive information on traffic conditions and access new services. In order to minimize deployment costs, multi-antenna techniques can be used to significantly reduce the energy consumption. Indeed, each crossroad comprises several road signs and can be considered as a communication node, able to reduce the energy needed to communicate with the neighbor nodes, thanks to the cooperation of its different elements. This study will lead to the realization of a transmission prototype between several crossroads and vehicles. This soft/hard platform including DSPs and FPGAs will especially take care to the consumption constraints, since the system performance will be real-time adapted to the energy available in each road sign.

7. Contracts and Grants with Industry

7.1. OSGAR (2003-2005)

Participants: Daniel Chillet, Nicolas Hervé, Daniel Menard, Sébastien Pillement, Olivier Sentieys.

OSGAR is a RNTL project, gathering the following partners: CEA-list, TNI-Valiosys, the university of Western Brittany, and R2D2.

This project aims at studying and developing tools for high-level synthesis able, starting from C code, semi-formal specifications or object code, to carry out an automatic migration towards one or more reconfigurable circuits. The object of the work undertaken by the team relates to the following points.

- The adaptation of the tools of the circuit design BSS software platform to reconfigurable architectures, in order to take into account in an automatic way the data coding and the size of the operators. To ensure this point, we need to define more precisely an exchange format between the different tools provided by the partners of this project. An XML application has been defined and new interfaces have been added to each tool. This point has been demonstrated during the RNTL days in October (4-5 October in Rennes).

- The modeling of reconfigurable architecture from the point of view of the developed software tools. The objective is to integrate in the models the power consumption aspects. The goal is to be able to provide estimates of power dissipated during prototyping.

- The validation by the implementation of two applications (image processing, WCDMA) on the various architectures considered in the project.

**Participants:** Georges Adouko, François Charot.

The Fastnet project has been contracted in March 2005. It is granted by the Brittany Region and it involves ENST Bretagne. It tackles the problematic of high-rate filtering, using architectures based on reconfigurable components that allow at the hardware level, specific algorithms to be implemented, and exhibiting this way a high degree of parallelism.

8. Other Grants and Activities

8.1. National initiatives

The team R2D2 participates to the activities of two multi-laboratory team of RTP (Pluridisciplinary Thematic Network) SoC of the CNRS): Pomard and SocLib. R2D2 members take part to the specific action *Low Power Design (AS106).*

The team R2D2 participates to the activities of:

- GdR-PRC ISIS (*Information Signal ImageS*), working group GT7 *Algorithms Architectures Adequation*.
- GdR-PRC ARP (*Architectures Réseaux et Parallélisme*), working group *Specialized architectures.*

8.1.1. ReMiX: Reconfigurable Memory for Indexing Huge Amount of Data

**Participants:** Gilles Georges, Steven Derrien, François Charot.

Indexing is a well-known technique that accelerates searches within large volumes of data such as the ones needed by applications related to genomics, to content-based image or text retrieval.

The ReMiX project proposes the design of a dedicated and very large index memory (several hundred of Giga-bytes, distributed among a cluster of nodes), big enough to entirely store huge indexes and avoid the use of any disk.

In addition, the index memory uses reconfigurable hardware resources to tailor – at the hardware level – the memory management to best support the specific properties of the indexing schemes. It also offers the opportunity to implement algorithms having potential parallelism.

An hardware platform based on Flash memory technology is being developed by the R2D2 team. The platform consists of several computing nodes connected through a high performance network interface. Each node is based on a Xilinx FPGA processing element coupled to 64 Gbyte of Flash memory. This approach allows to combine the benefits of hard-drive storage (non-volatility, density), with those of memory (bandwidth, access time) to efficiently support large indexed databases.

This three-year project (October 2003 - September 2006), coordinated by the Symbiose project, is funded by the French ministry (ACI Data Mass program). The team R2D2 is strongly involved in the design of the hardware platform.

8.2. International bilateral relations

8.2.1. Europe

R2D2 cooperates with the University of Leiden in the Netherlands (Ed Deprettere) on parallel architecture synthesis.

R2D2 cooperates with UCL at Louvain-La-Neuve on the topic of ternary technology integrated circuits. A prototype circuit has been developed with the SOI technology of the micro-electronics laboratory (DICE of UCL).
RDR2 cooperates with Lund University (Sweden) on Constraints Programming approach application in the reconfigurable data-paths synthesis flow.
R2D2 cooperates with the university of Girona in Spain (Computer Vision and Robotic Group of the Institute for Informatics and Applications) on parallel architectures for vision algorithms applied to underwater robot.

8.2.2. Africa
R2D2 cooperates with ENIT in Tunis on the topic of mobile telecommunication architectures.
R2D2 cooperates with the university of Antananarivo and the Polytechnic Superior School of Antananarivo in Madagascar, for the training of faculty members.

8.2.3. North America
R2D2 maintains relations with the computer science department of the University of Colorado State in Fort-Collins on the development of MMAlpha.
R2D2 cooperates with the LSSI laboratory of Trois-Rivières university in Québec, on the design of architectures for filters.
R2D2 cooperates with Los Alamos National Laboratory (USA) on optimized reconfigurable architectures implementations.
R2D2 cooperates with the University of California, Riverside, on optimized image processing applications synthesis.
R2D2 cooperates with the LRTS laboratory of Laval university in Québec on the topic of architectures for MIMO systems.

8.3. Visiting scientists

- Viorela Ila (University of Girona, Spain) from 02/03/05 for 1 week.
- S. Roy (University of Laval, Québec) from 11/12/2005 for 2 weeks, H. Bertrand and L. Dupond (University of Laval, Québec) from 10/10/2005 for 4 weeks.

9. Dissemination

9.1. Activities in the scientific community

O. Sentieys is a steering committee member of the SOC-SIP Expert Group at the department STIC of the CNRS. He is the chair of the IEEE Circuits and Systems (CAS) french chapter. He is a member of the French National University Council since 2000 in signal processing and electronics (Conseil National des Universités en 61ème section). He was a member of technical committee of the following conferences: DDECS, ISQED, DCIS, VTC, SYMPA, GRETSI, JFCAA.

P. Quinton is member of the steering committee of the System Architecture MOdelling and Simulation (SAMOS) workshop.

P. Scalart is the head of electronics engineering department at Enssat.

9.2. Teaching and responsibilities

O. Berder teaches a course on processors architectures and signal processing at Enssat.
D. Chillet teaches a course on advanced processors architectures in Master STIR.
H. Dubois is the associate academic director at Enssat.
M. Guitton is in charge of the communication at Enssat.
L. Perraudeau is responsible for a course on the object languages in the DESS Isa (Computer science and its applications) of the university of Rennes 1, teaches the design of integrated circuits in DIIC second year), and teaches in Licence d’informatique, in Deug Sciences, mention SM and STPI.
P. Quinton is responsible for the parallel algorithmic course (Alpa module) in the Master in computer science of the university of Rennes 1, teaches in Deug Sciences, mention SM and STPI, and in DIIC (second and third year). P. Quinton is deputy-director of Ecole Normale Supérieure de Cachan, responsible of the Brittany branch of this school.
O. Sentieys is responsible for a signal and architecture module of the Master STI of the University of Rennes 1 and the DRT in electronic of Enssat. He teaches at Enssat and gives courses on Methodologies for integrated system design in Master STI and on Low-power digital CMOS circuits at Enst de Bretagne.
C.Wolinski is responsible for Computer Organization and Architecture branch in DIIC. He is responsible for the following courses: CSE "Design of Embedded Systems" (DIIC), SIA "Signal, Image, Architectures" (DIIC), XAA" Advanced Architectures" (ENSC).

Graduate student intern: Sébastien Crase, Erwan Raffin, Delphine Reeb (Ifsic, France), Kevin Martin (ETGL, France), Julien Lallet, Erwan Grâce, Fabrice Santoro, Ludovick Lepaulou, Fabrice Barriere, Nicolas Mechouk, Javier Longares, Lalit Garg.

10. Bibliography

Major publications by the team in recent years


**Doctoral dissertations and Habilitation theses**


**Articles in refereed journals and book chapters**


### Publications in Conferences and Workshops


Bibliography in notes


