Section: New Results
HP-SoC simulation verification and synthesis
Participants : Adolf Abdallah, Rabie Ben Atitallah, Mouna Baklouti, Hajer Chtioui, Rosilde Corvino, Pierre Boulet, Jean-Luc Dekeyser, Abdoulaye Gamatié, Laure Gonnord, Thomas Legrand, Philippe Marquet, Samy Meftali, Smaïl Niar, Imran Rafiq Quadri.
Partial and Dynamic Reconfiguration (PDR) implementations
Current Gaspard Model transformation chain to Register Transfer Level (RTL) allows to generate two key aspects of a partial dynamically reconfigurable system: namely the dynamically reconfigurable region and the code for the reconfiguration manager that carries out the switch between the different configurations of this dynamic region. For this, the MARTE metamodel has been extended to integrate concepts of UML state machines and collaborations, which help in creation of mode automata semantics at the high abstraction levels. Integration of these concepts in the extended MARTE metamodel helps in the respective model-to-model transformations.
Moreover, the high level application model has several building blocks: the elementary components, each associated to several available intellectual properties (IPs). The current deployment level has been also extended to integrate the notion of “configurations”  , which are unique global implementations of the application functionality, with each configuration comprised of different combinations of IPs related to the elementary components. Using a combination of the deployment level and the introduced control semantics, it is possible for a designer to change the configuration related to an application, resulting in different results such as consumed FPGA resources, reconfiguration times, etc. We incorporate two model-to-model transformations in our flow, first the UML2MARTE transformation, with integrated state machine and configuration concepts. This transformation results in an intermediate MARTE model, which is converted into an RTL model by the MARTE2RTL transformation. The application model is converted into several implementations of a dynamically reconfigurable hardware accelerator, along with the source code for the configuration switch.
Finally, the design flow has been validated in the construction of a dynamically reconfigurable delay estimation correlation module  that is part of a complex anti-collision radar detection system in collaboration with IEMN Valenciennes  . The simulation results from the different configurations correspond to an initial MATLAB result, validating the different configurations. Additionally change of IPs related to a key elementary component in the module resulted in different reconfiguration times proving methodology.
The different chains available within Gaspard produce simulation code. Among them, the SystemC code generation allows simulations at the TLM-PA level. Regarding the architecture design, the process acts at as a connector between existing SystemC modules. They correspond to basic components such as memories, processors, caches. They are gathered in the Gaspardlib to be included or linked at the code compilation step. On one hand, both application and architecture IPs have been modeled using UML to easily drag-and-drop the available components inside the user's model. On another hand, we aimed at providing the most flexible design for the SystemC architecture. Each SystemC module is composed of an interface and behavioral sections; the interfaces have been updated to the new OSCI standard: TLM2 (http://www.systemc.org/apps/group_public/workgroup.php?wg_abbrev=tlmwg ). This update has allowed a high interoperability for our SystemC components with any other SystemC-TLM architecture. Consequently, additional SystemC modules have been integrated to extend the Gaspardlib. They come from other free simulation environments: ReSP (http://www.resp-sim.org ), SocLib (https://www.soclib.fr/trac/dev/wiki ), Unisim (https://www.soclib.fr/trac/dev/wiki ).
Similarly, several developments have been undertaken to provide ISS-based (Instruction Set Simulator) components for low-level simulations. A set of ISS have been extracted from the environments cited above and reorganized to be usable within Gaspard. An Eclipse standalone have been created to easily handle the whole Gaspardlib. It helps to build advanced architectures relying on SystemC. This environment is a preliminary development step as it allows one to realize ISS-simulations that may later be automatically generated by the Gaspard chains.
Clock-based analysis of embedded system behavior
Starting from the simulation clock properties of an embedded system (as described previously), we can now start an analysis of the system behavior. On the one hand, we verify whether or not the functional clock constraints specified by the designer in the application specification are met during the system execution on considered physical resources. When these constraints are not met, the simulation clock traces can be used to reason and find the solutions to satisfy the constraints. For instance, this may amount to decrease the speed of processors that compute data very fast or to increase the speed of processors that compute data very slowly. The modification of the processors speed by increasing or decreasing the speed should always respect the functional constraints imposed by the designer. It appears in the simulation clock traces by determining new physical clock properties from the suitable processor frequencies. Another example of solution may consist in delaying the first activation of a faster processor until an adequate time to begin the execution. Such an activation delay could be seen as minimizing the voltage/frequency.
On the other hand, we can reduce the power dissipation of the system which is the combination of dynamic and static power. The dynamic dissipated power, which represents 80-85% of the total power dissipation, can be lowered by minimizing the processors frequency which leads to an automatic reduction of the voltage. The static power dissipation can be reduced by cutting off the power supply for the processor if it is not used for a long period. These aspects can be addressed by analyzing the simulation clock traces.
IP based configurable massively parallel processing SoC
A methodology and a tool chain to design and build IP-based configurable massively parallel architectures is proposed. The defined architecture is named mppSoC, massively parallel processing System on Chip. It is a SIMD architecture composed of a number of processor elements (the PEs) working in perfect synchronization. A small amount of local and private memory is attached to each PE. Every PE is potentially connected to its neighbors via a regular network. Furthermore, each PE is connected to an entry of mpNoC, a massively parallel Network on Chip that potentially connects each PE to one another, performing efficient irregular communications. All the system is controlled by an Array Controller Unit (ACU). Our objective is to propose then a methodology to produce FPGA implementations of the mppSoC architecture.
The whole mppSoC architecture with its various components is implemented following an IP based design methodology. An implementation on FPGA, ALTERA StratixII 2s180, is proposed as a proof of feasibility. The architecture consists of general IPs (processor IPs, memory IPs, etc.) and specific IPs supplied with the mppSoC system (control IPs, etc.). Specific IPs are used as a glue to build the architecture. General IPs present a defined interface which must be respected by the designer if it wants to produce its own IP. For this kind of IPs we provide a library to alleviate their design. The designed architecture is configurable and parametric. In fact, to construct a mppSoC system, we assemble IPs to generate a FPGA configuration. The designer has to make different choices. He has to determine the different components in his architecture, for example if it contains an irregular communication network with a defined interconnection router  or a neighborhood one or both. Since we propose a parametric architecture, he has to choose also some architectural parameters such as the number of PEs, the memory size and the topology of the neighborhood network  if it exists. After fixing the architecture, the designer will choose then the basic IPs which will be used such as processor IP, interconnection network IP, etc. By this way, the user can choose the most appropriate mppSoC configuration satisfying his needs. To evaluate the proposed design methodology we have implemented different sized architectures with various configurations. We have also tested some examples of data parallel applications such as FIR, reduction, matrix multiplication, image rotation and 2D convolution. Through simulation results we can choose the most appropriate mppSoC configuration with the optimal performance metrics: execution time, FPGA resources and energy consumption. As a result we have proposed an IP based methodology for the construction of mppSoC system helping the designer to choose the best configuration for a given application. It is a first step towards a mppSoC architecture exploration.
Ongoing work aims at integrating the mppSoC in a real application such a video processing framework. Future work will aim at improving the proposed IP assembling methodology to construct mppSoC systems. Our ultimate goal is to provide a completely tool to generate a mppSoC configuration in order to help the designer in a semi-automatic architecture exploration for a given application.
Caches in MPSoCs
In Multi-Processor System-on-Chip (MPSoC) architectures using shared-memory, caches plays an important impact on performance and energy consumption levels.
When the executed application depicts a high degree of reference locality, caches may reduce the amount of shared-memory accesses and data transfers on the interconnection network. Hence, execution time and energy consumption can be greatly optimized. However, caches in MPSoC architectures put forward the data coherency problem. In this context, most of the existing solutions are based either on data invalidation or data update protocols. These protocols do not consider the change in the application behavior. This paper presents a new hybrid cache-coherency protocol that is able to dynamically adapt its functioning mode according to the application needs.
An original architecture which facilitates this protocol's implementation in Network-On-Chip based MPSoC architectures has been proposed in  . Performances, in terms of speed up factor and energy reduction gain of the proposed protocol, have been evaluated using a Cycle Accurate Bit Accurate (CABA) simulation platform. Experimental results in comparison with other existing solutions show that this protocol may give significant reductions in execution time and energy consumption can be achieved.