## Section: New Results

### Language Based Fault-Tolerance

Participants : Pascal Fradet, Alain Girault, Yoann Geoffroy, Gregor Goessler, Jean-Bernard Stefani, Martin Vassor, Athena Abdi.

#### Fault Ascription in Concurrent Systems

The failure of one component may entail a cascade of failures in other components; several components may also fail independently. In such cases, elucidating the exact scenario that led to the failure is a complex and tedious task that requires significant expertise.

The notion of causality *(did an event $e$ cause an event ${e}^{\text{'}}$?)*
has been studied in many disciplines, including philosophy, logic,
statistics, and law. The definitions of causality studied in these
disciplines usually amount to variants of the counterfactual test
“$e$ is a cause of ${e}^{\text{'}}$ if both $e$ and ${e}^{\text{'}}$ have occurred, and in a
world that is as close as possible to the actual world but where $e$
does not occur, ${e}^{\text{'}}$ does not occur either”. In computer science,
almost all definitions of logical causality — including the landmark
definition of [63] and its derivatives — rely
on a causal model that may not be known, for instance in presence of
black-box components. For such systems, we have been developing a
framework for blaming that helps us establish the causal relationship
between component failures and system failures, given an observed
system execution trace. The analysis is based on a formalization of
counterfactual reasoning [7].

In his PhD thesis, Yoann Geoffroy proposed a generalization of our fault ascription technique to systems composed of black-box and white-box components. For the latter a faithful behavioral model is given but no specification. The approach leverages results from game theory and discrete controller synthesis to define several notions of causality.

We are currently working on an instantiation of our general semantic
framework for fault ascription in [60] to
acyclic models of computation, in order to compare our approach with
the standard definition of *actual causality* proposed by Halpern
and Pearl.

#### Tradeoff exploration between energy consumption and execution time

We have continued our work on multi-criteria scheduling, in two
directions. First, in the context of dynamic applications that are
launched and terminated on an embedded homogeneous multi-core chip,
under execution time and energy consumption constraints, we have
proposed a two layer adaptive scheduling method. In the first layer,
each application (represented as a DAG of tasks) is scheduled
statically on subsets of cores: 2 cores, 3 cores, 4 cores, and so
on. For each size of these sets (2, 3, 4, ...), there may be only
one topology or several topologies. For instance, for 2 or 3 cores
there is only one topology (a “line”), while for 4 cores there are
three distinct topologies (“line”, “square”, and
“T shape”). Moreover, for each topology, we generate statically
several schedules, each one subject to a different total energy
consumption constraint, and consequently with a different Worst-Case
Reaction Time (WCRT). Coping with the energy consumption constraints
is achieved thanks to Dynamic Frequency and Voltage Scaling (DVFS). In
the second layer, we use these pre-generated static schedules to
reconfigure dynamically the applications running on the multi-core
each time a new application is launched or an existing one is
stopped. The goal of the second layer is to perform a dynamic global
optimization of the configuration, such that each running application
meets a pre-defined quality-of-service constraint (translated into an
upper bound on its WCRT) and such that the total energy consumption be
minimized. For this, we *(i)* allocate a sufficient number of
cores to each active application, *(ii)* allocate the unassigned
cores to the applications yielding the largest gain in energy, and
*(iii)* choose for each application the best topology for its
subset of cores (*i.e.*, better than the by default “line”
topology). This is a joint work with Ismail Assayad (U. Casablanca, Morocco) who
visited the team in September 2015.

Second, in the context of a static application (again represented a
DAG of tasks) running on an homogeneous multi-core chip, we have
worked on the static scheduling minimizing the WCRT of the application
under the multiple constraints that the reliability, the power
consumption, and the temperature remain below some given thresholds.
There are multiple difficulties: *(i)* the reliability is not an
invariant measure w.r.t. time, which makes it impossible to use
backtrack-free scheduling algorithms such as list
scheduling [33]; to overcome this, we adopt instead the
Global System Failure Rate (GSFR) as a measure of the system's
reliability, which is invariant with time [57];
*(ii)* keeping the power consumption under a given threshold
requires to lower the voltage and frequency, but this has a negative
impact both on the WCRT and on the GSFR; keeping the GSFR below a
given threshold requires to replicate the tasks on multiple cores, but
this has a negative impact both on the WCRT, on the power consumption,
and on the temperature; *(iii)* keeping the temperature below a
given threshold is even more difficult because the temperature
continues to increase even after the activity stops, so each
scheduling decision must be assessed not based on the current state of
the chip (*i.e.*, the temperature of each core) but on the state of the
chip at the end of the candidate task, and cooling slacks must be
inserted. We have proposed a multi-criteria scheduling heuristics to
address these challenges. It produces a static schedule of the given
application graph and the given architecture description, such that
the GSFR, power, and temperature thresholds are satisfied, and such
that the execution time is minimized. We then combine our heuristic
with a variant of the $\epsilon $-constraint
method [62] in order to produce, for a given application
graph and a given architecture description, its entire Pareto front in
the 4D space (exec. time, GSFR, power, temp.). This is a joint work
with Athena Abdi and Hamid Zarandi from Amirkabir U., Iran, who have visited the team
in 2016.

#### Automatic transformations for fault tolerant circuits

In the past years, we have studied the implementation of specific
fault tolerance techniques in real-time embedded systems using program
transformation [1].
We are now investigating the use of automatic transformations to
ensure fault-tolerance properties in digital circuits. To this aim, we
consider program transformations for hardware description languages
(HDL).
We consider both single-event upsets (SEU) and single-event transients
(SET) and fault models of the form *“at most 1 SEU or SET
within $n$ clock cycles”*.

We have expressed several variants of triple modular redundancy (TMR)
as program transformations. We have proposed a verification-based
approach to minimize the number of voters in
TMR [25]. Our technique guarantees that the
resulting circuit *(i)* is fault tolerant to the soft-errors
defined by the fault model and *(ii)* is functionally equivalent
to the initial one. Our approach operates at the logic level and takes
into account the input and output interface specifications of the
circuit. Its implementation makes use of graph traversal algorithms,
fixed-point iterations, and BDDs. Experimental results on the ITC’99
benchmark suite indicate that our method significantly decreases the
number of inserted voters which entails a hardware reduction of up to
$55\%$ and a clock frequency increase of up to $35\%$ compared to full
TMR. We address scalability issues arising from formal verification
with approximations and assess their efficiency and precision.
As our experiments show, if the SEU fault-model is replaced
with the stricter fault-model of SET, it has a minor impact on
the number of removed voters. On the other hand, BDD-based modeling of SET
effects represents a more complex task than the modeling of an SEU as a bit-flip.
We propose solutions for this task and explain the nature of encountered problems.
We discuss scalability issues arising from formal verification with approximations
and assess their efficiency and precision.

#### Concurrent flexible reversibility

Reversible concurrent models of computation provide natively what appears to be very fine-grained checkpoint and recovery capabilities. We have made this intuition clear by formally comparing a distributed algorithm for checkpointing and recovery based on causal information, and the distributed backtracking algorithm that lies at the heart of our reversible higher-order pi-calculus. We have shown that (a variant of) the reversible higher-order calculus with explicit rollback can faithfully encode a distributed causal checkpoint and recovery algorithm. The reverse is also true but under precise conditions, which restrict the ability to rollback a computation to an identified checkpoint. This work has currently not been published.