Section: New Results
Model-driven engineering for simulation, compilation, verification, evaluation and synthesis
Participants : Yassine Aydi, Rabie Ben Atitallah, Abou El Hassan Benyamina, Pierre Boulet, Jean-Luc Dekeyser, Cédric Dumoulin, Anne Etien, Abdoulaye Gamatié, Calin Glitia, Frédéric Guyomarc'h, Antoine Honoré, Sébastien Le Beux, Philippe Marquet, Samy Meftali, Smaïl Niar, Éric Piel, Imran Rafiq Quadri, Julien Taillard, Huafeng Yu.
A RTL metamodel
We propose to describe the hardware accelerators dedicated to intensive signal processing at RTL level (Register Transfer Level) i.e. independently from any HDL language (VHDL or Verilog for instance). For this purpose, we developed an RTL metamodel that enables description of hardware accelerator. This metamodel relies on a factorized expression of the parallelism included in hardware accelerators.
The RTL metamodel also enables the description of FPGA according to different views [Oops!] . One view is dedicated to the description of resources contained in an FPGA (storage, computing, etc.); another one focuses on FPGA topology (cell organisation) and the last one defines the FPGA configuration zones.
Some concepts of the RTL metamodel are dedicated to the mapping of a hardware accelerator onto FPGA. These provide implementation characteristics of a hardware accelerator for a given FPGA. Such information will allow a fine topological placement of the hardware accelerator onto the FPGA.
We are currently working on the automatic design space exploration for hardware accelerator. Based on FPGA implementation characteristics of a hardware accelerator, a Gaspard2 application is modified in order to generate (thanks to model to model transformation) another hardware accelerator. The aim of the design space exploration is to enhance the fit between a hardware accelerator and a specific FPGA.
A synchronous equational metamodel
A synchronous metamodel based on the results of last year on the synchronous modeling of Gaspard2 concepts has been proposed [Oops!] .
The synchronous data-flow languages, such as L ustre, S ignal and L ucid Synchrone, share several common features, which enable their code generation with the help of a unique synchronous metamodel. This metamodel aims not only at enabling models in the three synchronous data-flow languages, but also at bridging the gap between Gaspard and the synchronous technology, which offers a wide family of tools that allows designers to formally validate their models.
Transformation chains for code generation
As illustrated in figure 1 , the design approach adopted in the Gaspard2 framework considers model transformations for code generation towards different languages. These languages are intended for various purposes. All implemented transformation chains use MoMoTE and MoCodE. The next paragraphs present each transformation chain.
Towards synchronous dataflow languages
The synchronous transformation chain enables the code generation of synchronous dataflow languages, particularly the Lustre and Signal language [Oops!] , [Oops!] . These transformations start from high-level specifications of application . The implemented chain consists of about five thousands lines of code. Another code generator for the synchronous language Lucid synchrone is still under development.
A model to model transformation allows to transform a Gaspard application into the RTL metamodel [Oops!] . A specific focus on data dependencies has been set in order to compile both space and time dependencies [Oops!] . Then, the hardware accelerator model conform to the RTL metamodel is transformed into VHDL code via a code generation. The generated code can be easily simulated and synthesised onto FPGA thanks to classical commercial tools (Quartus from Altera for instance). Several experimentations validated this approach. A constraint file is also generated in order to place hardware accelerator design onto FPGA.
From deployed models to models conformed to the Loop metamodel
In order to produce the simplified model of a SoC expressed using the Loop metamodel from a deployed model, two successive model transformations have been defined. Following the MDE recommendations, an intermediary metamodel is used between those transformations: the Polyhedron metamodel. Each transformation is a set of transformation rules, each of them working on a very small set of elements. These implementations use MoMoTE.
The first transformation, from a deployed model to the Polyhedron metamodel is composed of 59 rules. They mainly allow us: to express the repetitions as polyhedrons, to separate the application tree following the association specification, to map the data arrays on the memories, and to simplify the deployment specifications. The second transformation, from the Polyhedron metamodel to the Loop metamodel, converts the mapping expressed by polyhedrons into loops expression. Each polyhedron is transformed into loops.
The two following transformations chains start from the Loop metamodel, as an intermediary one between the very high concepts represented in the deployed model and the concepts manipulated in the generated code.
OpenMP/Fortran code is generated starting from the Loop metamodel. The generation is made in two steps. The first step generates a model in the OpenMP metamodel from the Loop metamodel. It consists in scheduling the tasks in order to obtain a valid program, determining which variables have to be declared and generating synchronisation barrier when needed. Then from the OpenMP metamodel, OpenMP/Fortran code is generated. A code generator for OpenMP/C is still under development.
Starting from the Loop metamodel, a transformation generating SystemC code was implemented [Oops!] . It is based on the usage of templates with the MoCodE tool and it generates both the simulation of the hardware components and the application components. Each hardware component is transformed into a SystemC module with its ports linked. For each processor, the part of the application which has to be executed on this processor is generated as a set of activities dynamically scheduled and synchronised, following the model of execution defined for the Gaspard2 applications on MPSoCs. Additionally, the framework needed to automatically compile all the simulation code is also generated (as a Makefile). The level of abstraction is the TLM-PA level, allowing the user to see the execution of the program in term of patterns usages, the main data element of a Gaspard2 program, instead of reading and writing bytes. This also permits to speed up the simulation.
Performance evaluation of MPSoC in SystemC
In our previous work, we have developed an MPSoC platform described at the Timed Programmer View (PVT) level [Oops!] . This platform includes various kinds of component models that have been designed: processors, caches, interconnection network, RAM, DMA controller and a DCT hardware accelerator. At this level, a timing model is defined and plugged in the architectural simulator to approximate the execution time.
Nowadays, early power estimation is increasingly important in MPSoC architectures for a reliable Design Space Exploration (DSE). During this year, we have developed an MPSoC power modeling framework at the PVT level that "allows finding optimal architectural alternatives early in the design flow. These alternatives exhibit a good performance/power trade-off". Using a hybrid power modeling methodology, we developed several power models derived from both physical measurements and analytical expressions. Plugging these power models into the PVT architectural simulator makes it easy to estimate the application's performance and power consumption with high simulation speedup [Oops!] . Experimental results show that our MPSoC environment gives a high simulation speedup factor of up to 18 with a negligible performance and power estimation error margin.
New NoC for Intensive Signal Processing applications needs
Intercommunication requirements of future massively parallel SoCs are not satisfied with a single shared bus or with a hierarchical bus due to their poor scalability with the number of processor and their shared bandwidth between all the components.
To overcome this problem, Network on Chip (NoC) has been proposed by academia and industry as an on-chip communication challenge solution for the next generation of multiprocessor system on chip, denoted MPSoC architectures.
Historically, Multistage Interconnection Networks (MIN) are used in multi processors systems. We propose to use MIN as a Network on Chip to connect processors to memory modules on MPSoC.
Many variations of MINs have been introduced. A MIN is defined by, its topology, switching strategy, routing algorithm, scheduling mechanism and fault tolerance. A reconfigurable MIN can take advantage of the regular communication patterns to minimize the contention and to improve the global bandwidth of the system. Simulation in SystemC at different levels of abstraction is a first exploration before synthesis on FPGA.
In this context, we are also developing new coherency maintenance protocols especially dedicated to MPSoC architectures equipped with high performance NoC. In this study, we aim to demonstrate that coherency protocols dedicated to shared memory MPSoC equipped with NoC (such as MIN or crossbar) are more efficient than existing protocols. The under design protocols take into-account the behaviour of the application and automatically adapts the manner by which read/write operations are performed.
Implementing partial dynamic reconfiguration in FPGAs
The domain of Reconfigurable Architectures is very vast and has lead to a spectrum of Reconfigurable Architectures (RAs). These RAs can be classified in general as either Coarse Grain RAs or Fine Grain RAs. The Coarse Grain RAs are basically suited for specific, customized data path applications with advantages of having increased performance and less communication delays. The disadvantage of these RAs is their lack of flexibility to adapt to general applications. Fine Grain RAs (mainly SRAM Based commercial FPGAs) work at the bit level manipulation level and offer greater flexibility in terms of adaptability to applications with a cost of increased reconfiguration time, performance and communication delays.
We basically focus on these Fine Grain RAs, i.e. FPGAs particularly Xilinx's Virtex Series FPGAs which have the capability to carry out Partial Dynamic Run time Reconfiguration (PDR). The objective of the thesis is to extend the work already carried in our Gaspard2 chain and to introduce the notion of dynamic reconfiguration in Gaspard2. For this purpose, we plan to introduce certain aspects of the UML Profile for MARTE (Modeling and Analysis of Real-Time and Embedded Systems) in the Gaspard2 Environment.
We have explored the architecture necessary to carry out the PDR [Oops!] . Addition of Physical properties (such as power consumption, area layout, reconfiguration time) are some of the notions to be included related to PDR in our work. A notion of QoS is also to be implemented associated with a controller responsible to carry out the reconfiguration. This mainly helps to change the context of the application depending on the user needs.
Although, the notion of Control (for the synthesis of VHDL) is present in the existing environment i.e. the static control implemented in the FPGA can choose between different configurations, the only problem is that it is static in nature. For our need, we need to introduce the notion of dynamic control in the Gaspard2 environment which will be integrated with the notion of QoS.
Compilation technique for data-parallelism
The Array-OL transformation toolbox now implemented and integrated in Gaspard2 was tested on real applications. As a result of these tests, we got some important feed-back on the transformations. The major problem is introduced by the presence of more than one intermediary array between two tasks intended for fusion. The solution previously proposed and implemented implied the multiplication of the first task, which introduces a lot of re-calculations. A new partial solution was proposed and developed for the case where the intermediary arrays are represented by the same array but consumed multiple times.
Another direction of research is represented by the introduction of inter-repetition dependencies and their impact on the Array-OL transformations. As a first result was the need to extend the concept of inter-repetition dependence by allowing multiple default-links characterized by tilers (the tilers construction guaranties that just one default-link can be chosen for a repetition instant). An additional stage will be added to the Array-OL transformations, the one that will transform the inter-repetition dependencies according to the repartition of repetitions before and after the transformation.
Due to the complex way we can express pattern consumption/producing in Array-OL with the help of tilers, the task of writing those tilers is not trivial, especially for users not familiar with the semantics of Array-OL. To ease this task we have implemented a tiler editor that can be connected anywhere we need to and provides help by identifying some probable choices for a tiler (separate dimensions for repetition and pattern, patterns parallel with the axes, sliding windows). The next step is to extend this editor with a visual interface more suited for more complicated constructions.
The work about the GILR optimization heuristics has continued with the visit of Abou El Hassan from the university of Oran, Algeria. We have started the investigation of a hierarchical genetic algorithm based heuristic. Preliminary results have been published [Oops!] .