Team Cairn

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Compilation and Synthesis for Reconfigurable Platform

Participants : Steven Derrien, Emmanuel Casseau, Daniel Ménard, François Charot, Christophe Wolinski, Olivier Sentieys, Patrice Quinton.

Optimized Synthesis of Processor Extensions in the DURASE System

Participants : Christophe Wolinski, François Charot, Erwan Raffin, Kevin Martin, Antoine Floch.

In the context of the DURASE system, this year, we have focused on the optimization of the processor extension synthesis. We have developed an original method based on constraint programming enabling the global minimization of the logic elements of the FPGA processor extensions' implementation.

The abstract generic architecture model of a processor extension is depicted in Figure 4 . It is composed of a processor interface (Figure 4 shows the NIOSII processor interface), a set of processing units U , a set of registers R and two sets of multiplexers MAS and MBS respectively. The number of registers is parametrized and each register is identified by a unique number rid . The number and types of processing units are also parametrized.

Figure 4. Generic architecture model.

The processing units can be heterogeneous, i.e., each unit can execute a specific set of complex operations. Generally, a particular unit can contain a run-time reconfigurable, at the functional level, data-path. Each processing unit is identified by a unique number uid and can have several input and output ports (P ), identified by their unique number pid . Only two operands can be sent by a processor to an extension during one cycle using buses MAS and MBS . This parameter is defined by the processor interface and it applies, in our case, to the NIOSII processor. Only one result can be sent back by an extension to a processor during one cycle and this assumption is also specific for the NIOSII processor. Data transferred from the processor to the extension and data passed directly (without processor intervention) between processing units can be stored for further processing in an extension's internal register file.

In the context of this project, in the first step, we have defined a constraint programming model of the generic architecture presented in Figure 4 . In the second step, a corresponding tool was built. The tool is capable of minimizing the number of logic elements taking into account the registers and multiplexers simultaneously. The synthesis results confirm the efficiency of our approach. In average, a 30% improvement in the number of FPGA logic elements needed for the processor extensions' implementation was observed (full details in the Ph.D. thesis of Kevin Martin [16] ).

Run-time Reconfigurable Architecture Modeling

Participants : Christophe Wolinski, François Charot, Emmanuel Casseau, Daniel Ménard, Antoine Floch, Erwan Raffin, Steven Derrien.

Roma project

We have continued to work on the modeling problem of the run-time partially reconfigurable ROMA processor in order to optimize the execution time of the application. The ROMA processor is composed of a set of coarse grain reconfigurable operators, data memories, configuration memories, operator network, data network, control network and a centralized controller. The centralized controller manages the configuration steps and the execution steps. The ROMA processor has three different interfaces: one data interface connected to the operator network, one control interface and one debug interface connected to the main controller. The reconfigurable operators are connected together via a dedicated network (called operator-operator network) and to the data memories via another network (called data memory-operator network). The local memories have their own programmable address generators. Figure 5 shows the block diagram of the ROMA processor. The main controller (Global CTRL) executes a C program defining synchronizations between the configuration and execution sequences.

Figure 5. Architecture of ROMA processor : the control structure includes a Global CTRL and dedicated controllers designated for each module of the reconfigurable datapath. The reconfigurable datapath is composed of data memory banks, two interconnection networks and a set of coarse grain reconfigurable operators.

In order to support this kind of architecture a new extension of the DURASE system was developed (Figure 6 ).

Figure 6. DURASE global design flow overview.

As shown in Figure 6 , the inputs to our system are an application program written in C and an abstract generic parallel run-time reconfigurable architecture model. The outputs are the C program and the configuration information (binary files) needed to manage the run-time reconfigurable ROMA architecture.

The newly developed system is part of the DURASE system (see Figure 6 ). It implements our new method, based on CP, that enables to model complex run-time reconfigurable architectures together with their application programs. The model can then be used to perform scheduling, binding and routing while optimizing application's execution time. Our system contains also the target dependent back-end compiler (in our case, the supporting ROMA architecture).

We have carried out extensive experiments to evaluate the quality of our newly developed system. All experiments have been run on 2GHz Intel Core Duo under the Windows XP operating system. In our experiments, the ROMA abstract model has been instantiated with 8 memories and 4 operators. All operators support the same types of computations and the delay of a computation is the same, independently to its resource assignment. The following latencies have been assumed WRlat = RDlat = ope_opelat = 1 . We have also assumed that all data is stored in memories before processing starts. In 78% of the cases, our system provides optimal results, confirming the high quality of our scheduling, binding and routing system [69] .

RecMotifs project

In the context of the RecMotifs project, we have continued to work on a specific design flow integrating STMicroelectronics' compiler and our development platform. We have also defined a new CP (Constraint Programming) model [75] , [46] of the scheduler well adapted to a parallel architecture. Our generic simplified architecture is composed of functionally reconfigurable cells implementing a set of computational patterns (selected by our system). The cells are connected directly to the processor data-path. The cell contains also registers for local and intermediate data. The cells can communicate through the crossbar switch. The number of registers and the structure of interconnections are application dependent. Cells can also have a local memory to store coefficients and data needed for processing. In this case, the memory has two ports, one connected to the cell and the second connected directly to the processor. The address generation can be ensured by the memory address generator.

In the context of this project the DURASE flow was modified. The main contribution is the new parallel architecture composed of an ASIP processor and a functionally reconfigurable cell fabric. The new design method for pattern selection uses also a new model of graph covering for this architecture when scheduling instructions for parallel execution. Moreover, we model detail architectural constraints. The presented method substantially extends the DURASE system, which can now be applied to generic parallel architectures.

We have carried out experiments to evaluate the possible speed-up that can be obtained using the NiosII processor (running at 200MHz on a Stratix2 Altera FPGA) extended with functionally reconfigurable cell fabric. The patterns have been generated with an assumption that the number of inputs cannot exceed four inputs and the number of outputs can not exceed two outputs. Results, obtained for selected applications from MediaBench, MiBench and Cryptographic Library benchmark sets, have been presented in [46] . These applications are written in C and compiled using our design flow for the ALTERA NIOSII target processor.

Floating-Point to Fixed-Point Conversion

Participants : Daniel Ménard, Karthick Parashar, Olivier Sentieys, Romuald Rocher, Hai-Nam Nguyen.

In [57] a hierarchical approach has been proposed to perform word-length optimization of a complete system made-up of several subsystems. At the system level, the fixed-point behavior of each subsystem is modeled by a single noise source located at the subsystem output. The aim is to find the noise power levels of each noise source so as to minimize the implementation cost while maintaining the overall performance. This year experiments have been carried-out on a MIMO-OFDM receiver to demonstrate the efficiency of our approach.

For the fixed-point conversion process, different optimization algorithms have been tested. An improvement of the word-length optimization techniques based on genetic algorithm has been proposed. The quality of the solution is improved without increasing the optimization time. The execution time of this kind of algorithm is quite long but it allows obtaining directly the Pareto curve of the cost according to the accuracy constraint. As example, this curve is used in our hierarchical approach. The use of the GRASP algorithm for word-length optimization has been proposed. Compared to the genetic algorithms, this approach allows reducing the optimization time for a given accuracy constraint and improving the solution quality.


Logo Inria