Section: Scientific Foundations
Keywords : Rapid prototyping, system level CAD, hardware/software codesign, partitioning, multiprocessor, parallel, distributed, specific integrated circuit, realtime, embedded, graph, partial order, synchronous languages, RTL, optimization, offline, online, realtime scheduling, realtime operating system, executive, fault tolerance.
Mapping onto Embedded platforms
Participants : Liliana Cucu, Dumitru Potop, Yves Sorel.
The AAA methodology
The AAA methodology (AlgorithmArchitecture Adequation) allows to specify ``application algorithms'' (functionalities) and redundant ``multicomponent architectures'' (composed of processors and specific integrated circuits all together interconnected) with graph models. Consequently, all the possible implementations of a given algorithm onto a given architecture is described in terms of graphs transformations. An implementation consists in distributing and scheduling a given algorithm onto a given architecture. Adequation amounts to chose one implementation among all the possible ones, such that the realtime and embedding constraints are satisfied and the hardware redundancy is fully used. Furthermore, from the adequation results our graph models allow to generate automatically, as an ultimate graphs transformation, two types of codes: dedicated distributed realtime executives or configuration of standard distributed realtime executives (RTlinux, OSEK, etc) for processors, and netlists (structural VHDL) for specific integrated circuits. Finally fault tolerance is of great concern because the applications we are dealing with are often critical, that is to say, may lead to catastrophic consequences when they fail. The AAA methodology provides a mathematical framework for rapid prototyping and hardware/software codesign taking into account fault tolerance.
From the optimization point of view, realtime systems are, first of all, ``reactive systems'' which mandatorily must react to each input event of the infinite sequence of events it consumes, such that ``cadence'' and ``latency'' constraints are satisfied. The latency corresponds to the delay between an input event consumed by the system and an output event produced by the system in reaction to this input event. The cadence corresponds to the delay between two successive input events, i.e. a period. The term event is used in a broad sense, it may refers to a periodic or to an aperiodic discrete (sampled) signal. When hard (critical) realtime is considered, offline approaches are preferred due to their predictability and best performances, and when online approaches are unavoidable, mainly to take into account aperiodic events, we intend to minimize the decisions taken during the realtime execution. When soft realtime is considered offline and online approaches are mixed. The application domains we are involved in, e.g. automobile, avionic, lead to consider scheduling problems for systems of tasks with precedence, latency and periodicity constraints. We seek optimal results in the monoprocessor case where distribution is not considered, and suboptimal results through heuristics in the multiprocessor case, because the problems are NPhard due to distribution consideration. Also, in addition to these timing constraints, embedded systems must satisfy technological constraints, such as power consumption, weight, volume, memory, etc, leading in general to minimize hardware resources. In the most general case architectures are distributed, and composed of several programmable components (processors) and several specific integrated circuits (ASIC (ASIC : Application Specific Integrated Circuit)or FPGA (FPGA : Field Programmable Gate Array)) all together interconnected with possibly different types of communication media. We call such heterogeneous architectures ``multicomponent'' [39] .
The complexity, not only of the algorithms that must be implemented, but also of the hardware architectures, and also the multiple constraints, imply to use methodologies when development cycle time must be minimized from the high level specification until the successive prototypes which ultimately will become a commercial product. In order to avoid gaps between the different steps of the development cycle our AAA methodology is based on a global mathematical framework which allows to specify the application algorithms as well as the hardware architecture with graph models, and the implementation of algorithms onto architectures in terms of graphs transformations. This approach has the benefit on the one hand to insure traceability and consistency between the different steps of the development cycle, and on the other hand to perform formal verifications and optimizations which decrease realtime tests, and also to perform automatic code generation (realtime executives for processors and netlist for specific integrated circuits). All these benefits contribute to minimize the development cycle. Actually, the AAA methodology provides a framework for hardware/software codesign where safe design is achieved by construction, and automatic fault tolerance is possible only by specifying the components that the user accepts to fail.
To summarize, we are interested in the optimization of distributed realtime embedded systems according to four research topics:

models for specifying, with graphs and partial orders, application algorithm, hardware architecture, and optimized implementation,

implementation optimization:

automatic code generation for processor (dedicated or standard RTOS configuration) and for specific integrated circuit (netlist),

fault tolerance.
Beside these researches, we propose a tool implementing the AAA methodology. It is a system level CAD software called SynDEx ( http://www.syndex.org ). This software, coupled with a high level specification language, like one of the Synchronous Languages or Scicos, leads to a seamless environment allowing to perform rapid prototyping and hardware/software codesign while reducing drastically the development cycle duration and providing safe design.
We next describe in greater details the three main modeling aspects involved in the approach.
Algorithm model
Our algorithm model is an extension of the well known dataflow model from Dennis [64] . It is a directed acyclic hypergraph (DAG) [63] that we call ``conditioned factorized data dependence graph'' [33] , whose vertices are ``operations'' and hyperedges are directed ``data or control dependences'' between operations. Hyperedges are necessary in order to model data diffusion since a standard edge only relates a pair of operations. The data dependences defines a partial order on the operations execution [72] , called ``potential operationparallelism''. Each operation may be in turn described as a graph allowing a hierarchical specification of an algorithm. Therefore, a graph of operations is also an operation. Operations which are the leaves of the hierarchy are said ``atomic'' in the sense that it is not possible to distribute each of them on more than one computation resource. The basic dataflow model was extended in three directions, firstly infinite (resp. finite) repetitions in order to take into account the reactive aspect of realtime systems (resp. ``potential dataparallelism'' similar to loop or iteration in imperative languages), secondly ``state'' when data dependence are necessary between repetitions introducing cycles which must be avoided by specific vertices called ``delays'' (similar to z^{ n} in automatic control), thirdly ``conditioning'' of an operation by a control dependence similar to conditional control structure in imperative languages. Delays combined with conditionings allow to specify FSM (Finite State Machine) necessary for describing ``mode changes'', e.g. some control law is performed when the motor is the state ``idle'' whereas another one is performed when it is in the state ``permanent''. Repetition and conditioning are both based on hierarchy. Indeed, a repeated or ``factorized graph of operations'' is a hierarchical vertex specified with a ``repetition factor'' (factorization allows to display only one repetition). Similarly, a ``conditioned graph of operations'' is a hierarchical vertex containing several alternative operations, such that for each infinite repetition, only one of them is executed, depending on the value carried by the ``conditioning input'' of this hierarchical vertex. Moreover, the proposed model has the synchronous language semantics [66] , i.e. physical time is not taken into account. This means that it is assumed an operation produces its output events and consumes its inputs events simultaneously, and all the input events are simultaneously present. Thus, by transitivity of the execution partial order associated to the algorithm graph, outputs of the algorithm are obtained simultaneously with its inputs. Each input or output carries an infinite sequence of events taking values, which is called a ``signal''. Here, the notion of event is general, i.e. signals may be periodic as well as aperiodic. The union of all the signals defines a ``logical time'', where physical time elapsing between events are not considered.
Architecture model
The typical coarsegrain architecture models such as the PRAM (Parallel Random Access Machines) and the DRAM (Distributed Random Access Machines) [73] are not enough detailed for the optimizations we intend to perform. On the other hand the very fine grain RTLlike (Register Transfer Level) [71] models are too detailed. Thus, our model of multicomponent architecture is also a directed graph [25] , whose vertices are of four types: ``operator'' (computation resource or sequencer of operations), ``communicator'' (communication resource or sequencer of communications, e.g. DMA), memory resource of type RAM (random access) or SAM (sequential access), ``bus/mux/demux/(arbiter)'' (choice resource or selection of data from or to a memory) possibly with arbiter (arbitration of memory accesses when the memory is shared by several operators), and whose edges are directed connections. For example, a typical processor is a graph composed of an operator, interconnected with memories (program and data) and communicators, through bus/mux/demux/(arbiter). A ``communication medium'' is a linear graph composed of memories, communicators, bus/mux/demux/arbiters corresponding to a ``route'', i.e. a path in the architecture graph. Like for the algorithm model, the architecture model is hierarchical but specific rules must be satisfied, e.g. a hierarchical memory vertex may be specified with bus/mux/demux and memories (e.g. several banks), but not with operator. Although this model seems very basic, it is the result of several studies in order to find the appropriate granularity allowing, on the one hand to provide accurate optimization results, and on the other hand to quickly obtain these results during the rapid prototyping phase. Data communications can be precisely modeled through shared memory or through message passing possibly using routes. Furthermore, complex interactions between operators and communicators can be taken into account through bus/mux/demux/arbiter, e.g. when communications with DMA require the sequencer of a processor.
Our model of integrated circuit architecture is the typical RTL model. It is a directed graph whose vertices are of two types: combinatorial circuit executing an instruction, and register storing data used by instructions, and whose edges are data transfers between a combinatorial circuit and a register, and reciprocally.
In order to unify both multicomponent and integrated circuit models we extend the RTL model in a new one called ``macroRTL''. Thus, an operator executes ``macroinstructions'', i.e. operations, which consume and produce data in ``macroregisters''. This model allows to encapsulate specific details related to the instructions set such as cache, pipeline and other non deterministic features of processors that are difficult to take into account.
Implementation mapping
An implementation of a given algorithm onto a given multicomponent architecture corresponds to a distribution and a scheduling of, not only the algorithm operations onto the architecture operators, but also a distribution and a scheduling of the data transfers between operations [27] onto communication media.
The distribution consists in distributing each operation of the algorithm graph onto an operator of the architecture graph. This leads to a partition of the operations set, in as many subgraphs as there are operators. Then, for each operation two vertices called ``alloc'' for allocating program (resp. data) memory are added, and each of them is allocated to a program (resp. data) RAM connected to the corresponding operator. Moreover, each ``interoperator'' data transfer between two operations distributed onto two different operators, is distributed onto a route connecting these two operators. In order to actually perform this data transfer distribution, according to the element composing the route as many ``communication operations'' as there are communicators, as many ``identity'' vertices as there are bus/mux/demux, and as many ``alloc'' vertices for allocating data to communicate as there are RAM and SAM, are created and inserted. Finally, communication operations, identity and alloc vertices are distributed onto the corresponding vertices of the architecture graph. All the alloc vertices, those for allocating data and program memories as well as those for allocating data to communicate, allow to determine the amount of memory necessary for each processor of the architecture.
The scheduling consists in transforming the partial order of the corresponding subgraph of operations distributed onto an operator, in a total order. This ``linearization of the partial order'' is necessary because an operator is a sequential machine which executes sequentially the operations. Similarly, it also consists in transforming the partial order of the corresponding subgraph of communications operations distributed onto a communicator, in a total order. Actually, both schedulings amount to add edges, called ``precedence dependences'' rather than data dependences, to the initial algorithm graph. To summarize, an implementation corresponds to the transformation of the algorithm graph (addition of new vertices and edges to the initial ones) according to the architecture graph.
Finally, the set of all the possible implementations of a given algorithm onto a given architecture may be modeled, in intention, as the composition of three binary relations: namely the ``routing'', the ``distribution'', and the ``scheduling'' [41] . Each relation is a mapping between two pairs of graphs (algorithm graph, architecture graph). It also may be seen as a external compositional law, where an architecture graph operates on an algorithm graph in order to give, as a result, a new algorithm graph, which is the initial algorithm graph distributed and scheduled according to the architecture graph. Therefore, the ``implementation graph'' is of type algorithm that may in turn be composed with another architecture graph, allowing complex combinations.
The set of all the possible implementations of a given algorithm onto a specific integrated circuit is different because we need a transformation of the algorithm graph into an architecture graph which is directly the implementation graph. This graph is composed of two parts: the datapath obtained by translating each operation in a corresponding logic function, and the control path obtained by translating each control structure in a ``control unit'', which is a finite state machine made of counters, multiplexers, demultiplexers and memories, managing repetitions and conditionings [21] .
Optimization
We must choose among all the possible implementations a particular one for which the constraints are satisfied and possibly some criteria are optimized.
In the case of a multiprocessor architecture the problem of finding the best distribution and scheduling of the algorithm onto the architecture, is reputed to be of NPhard complexity [68] . This amounts to consider, in addition to precedences constraints specified through the algorithm graph model, one latency constraint between the first operation(s) (without predecessor) and the last operation(s) (without successor), equal to a unique periodicity constraint (cadence) for all the operations. We propose several heuristics based on the characterization of the operations (resp. communication operations) relatively to the operators (resp. communicators). For example the total execution time of the algorithm (makespan) onto the distributed architecture, may be minimized with a cost function taking into account the schedule flexibility of operations, and also the increase of the critical path when two operations are distributed onto two different operators inducing a communication, possibly through concurent routes [27] . The characterization amounts to relate the logical time described by the interleaving of events with the physical time. We mainly develop ``greedy heuristics'' because they are very fast [69] , and thus, well suited to rapid prototyping of realistic industrial applications. In this type of applications the algorithm graph may have up to ten thousand vertices and the architecture graph may have several tens of vertices. However, we extend these greedy heuristics to iterative versions [40] which are much slower, due to backtracking, but give better results when it is time to produce the final commercial product. For the same reason we also develop local neighborhood heuristics such as simulated anealing, and also genetic algorithm, all based on the same type of cost function.
New applications in the automobile, avionic, or telecommunication domains, lead us to consider more complex constraints. In such applications it is not sufficient to consider the execution duration of the algorithm graph. We need also to consider periodicity constraints for the operations, possibly different, and several latency constraints imposed possibly on whatever pair of operations. Presently there are only partial and simple results for such situations in the multiprocessor case, and only few results in the monoprocessor case. Then, we began few years ago to investigate this research area, by interpreting, in our algorithm graph model, the typical scheduling model given by Liu and Leyland [70] for the monoprocessor case. This leads us to redefine the notion of periodicity through infinite and finite repetitions of an operations graph (i.e. the algorithm), thus generalizing the SDF (Synchronous DataFlow) model [67] proposed in the software environment Ptolemy. For simplicity reason and because this is consistent with the application domains we are interested in, we presently only consider that our realtime systems are nonpreemptive, and that ``strict periodicity'' constraints are imposed on operations, meaning that an operation starts as soon as it period occurs. In this case we give a schedulability condition for graph of operations with precedence and periodicity constraints in the nonpreemptive case [20] .
We also formally defined the notion of latency which is more powerful [19] , for the applications we are interested in, than the usual notion of ``deadline'' that does not allow to impose directly a timing constraint on a pair of operations, connected by at least one path, like it is necessary for ``endtoend constraints''. In order to study schedulability conditions for multiple latency constraints we defined three relations between pair of paths, such that for each pair a latency constraint is imposed on its extremities. Using these relations called II , Z and X , we give a schedulability condition for graph of operations with precedence and latency constraints in the nonpreemptive case. Then by combining both previous results we give a schedulability condition for graph of operations with precedence, periodicity and latency constraints in the nonpreemptive case, using an important result which gives a relation between periodicity and latency. We also give an optimal scheduling algorithm in the sense that, if there is a schedule the algorithm will find it.
Thanks to these results obtained in the monoprocessor (one operator) case we study our problem of distribution and scheduling in the multiprocessor case (several operators) with more complex constraints that we did previously, i.e. with precedence, and one latency constraint equal to a unique periodicity constraint. We proved this problem is NPhard for systems with precedence and periodicity constraints, we proposed a heuristic which takes into account the communication times. We proved that operations with periods which are not coprime can not be scheduled on the same operator. We proved this problem is NPhard for systems with precedence and latency constraints, we proposed a heuristic which takes into account the communication times. This heuristic uses the schedulability results obtained in the case of one operator concerning the three relations II , Z and X between pairs of operations, on which latency constraints are defined. These latter results prove that the best way of scheduling operations is to avoid scheduling, between the first and the last operation of a latency constraint, operations which do not belong to this latency constraint. Finally we proved this problem is NPhard for systems with precedence, periodicity and latency constraints, we proposed a heuristic which takes into account the communication times. We proved that operations belonging to the same latency constraint must have the same period. A direct consequence is that the operations belonging to the same pair or to pairs which are in relation II , Z or X must have the same period. So, the heuristic may use the main ideas of the heuristic for the case of precedence and latency constraints and of the heuristic for the case of precedence and periodicity constraints. The performances of these three heuristics were compared to those of exact algorithms. The numerical results show that the heuristics are definitely faster then the exact algorithms for all cases when the heuristics find a solution.
The aforementioned scheduling problems only takes into account periodic operations. Aperiodic operations issued from aperiodic events, usually related to control, must be handled online. Presently we take them into account offline by integrating the controlflow in our dataflow model, well suited to distribution, and by maximizing the control effects. We study relations between controlflow and dataflow in order to better exploit their respective advantages. Finally, we study the possibility to mix offline for periodic operations and online approaches for aperiodic operations.
In the case of integrated circuit the potential parallelism of the algorithm corresponds exactly to the actual parallelism of the circuit. However, this may lead to exceed the required surface of an ASIC or the number of CLB (Combinatorial Logic Block) of a FPGA, and then some operations must be sequentially repeated several times in order to reuse them, reducing in this way the potential parallelism to an actual parallelism with less logic functions. But reducing the surface has a price in terms of time, and also in terms of surface but of a lesser importance, due to the sequentialization itself (instead of parallelism) performed by the finite state machines (control units) necessary to implement the repetitions and the conditionings. Then, we are seeking a compromise taking into account surface and performances. Because these problems are again of NPhard complexity, we propose greedy and iterative heuristics in order to solve them [21] .
Finally, we plan to work on the unification of multiprocessor heuristics and integrated circuit heuristics in order to propose ``automatic hardware/software partitioning'' for codesign, instead of the usual manual one. The most difficult issue concerns the integration in the cost functions of the notion of ``flexibility'' which is crucial for the choice of software versus hardware. However, this optimization criterion is difficult to quantify because it mainly relies on user's expertise.
Automatic code generation
As soon as an implementation is chosen among all the possible ones, it is straightforward to automatically generate executable code through an ultimate graphs transformation leading to a distributed realtime executive for the processors, and to a structural hardware description, e.g. synthetizable VHDL, for the specific integrated circuits.
For a multicomponent each operator (resp. each communicator) has to execute the sequence of operations (resp. communication operations) described in the implementation graph. Thus, this graph is translated in an ``executive graph'' [28] where new vertices and edges are added in order to manage the infinite and finite loops, the conditionings, the interoperator data dependences corresponding to ``read'' and ``write'' when the communication medium is a RAM, or to ``send'' and ``receive'' when the communication medium is a SAM. Specific vertices, called ``pre'' and ``suc'', which manage semaphores, are added to each read, write, send and receive vertices in order to synchronize the execution of operations and of communication operations when they must share, in mutual exclusion, the same sequencer as well as the same data. These synchronizations insure that the realtime execution will satisfy the partial order specified in the initial algorithm. Executives generation is proved to be deadlock free [25] maintaining the properties, in terms of events ordering, shown thanks to the synchronous language semantics. This executive graph is directly transformed in a macrocode [26] which is independent of the processor. This macrocode is macroprocessed with ``executive kernels'' libraries which are dependent of the processors and of the communication media, in order to produce as many source codes as there are processors. Each library is written in the best adapted language regarding the processors and the media, e.g. assembler or high level language like C. Finally, each produced source code is compiled in order to obtain distributed executable code satisfying the realtime constraints.
For an integrated circuit, because we associate to each operation and to each control unit an element of a synthetizable VHDL library, the executable code generation relies on the typical synthesis tools of integrated circuit CAD vendors like Synopsis or Cadence.
Fault tolerance
For the applications we are dealing with, if realtime constraints are not satisfied, this may have catastrophic consequences in terms of human beings lost or pollution, for example. When a fault occurs despite formal verifications which allow safe design by construction, we propose to specify the level of fault the user accepts by adding redundant processors and communication media. Then, we extended our optimization heuristics in order to generate automatically the redundant operations and data dependences necessary to make transparent these faults. Presently, we only take into account ``fail silent'' faults. They are detected using ``watchdogs'', the duration of which depends on the operations and data transfers durations. We first obtained results in the case of processor faults only, i.e. when the communication media are assumed error free. Then, we studied, in addition to processors faults, media faults.
We propose three kinds of heuristics to tolerate both faults. The first one tolerates a fixed number of arbitrary processors and links (pointtopoint communication medium) faults. It is based on the software redundancy of operations. The second one tolerates a fixed number of arbitrary processors and buses (multipoint communication medium) faults. It is based on the active software redundancy of the operations and the passive redundancy of the communications with the fragmentation in several packets of the transfered data on the buses. The third one tolerates a fixed number of arbitrary processors and communication media (pointtopoint or multipoint) faults. It is based on a quite different approach. This heuristic generates as much distributions and schedulings as there are of different architecture configuration corresponding to the possible faults. Then, all the distributions and schedulings are merged together to finally obtain a resulting distribution and a scheduling which tolerates all the faults.
Finally, we propose a heuristic for generating reliable distributions and schedulings. The software redundancy is used to maximize the reliability of the distribution and scheduling taking into account two criteria: the minimization of the latency (execution duration of the distributed and scheduled algorithm onto the architecture) and the maximization of the reliability of the processors and the communication media.
As soon as the redundant hardware is fully exploited, ``degraded modes'' are necessary. They are specified at the level of the algorithm graph by combining delays and conditionings.