## Section: New Results

### Language Based Fault-Tolerance

Participants : Pascal Fradet, Alain Girault, Gregor Goessler, Jean-Bernard Stefani, Martin Vassor.

#### Fault Ascription in Concurrent Systems

The failure of one component may entail a cascade of failures in other components; several components may also fail independently. In such cases, elucidating the exact scenario that led to the failure is a complex and tedious task that requires significant expertise.

The notion of causality *(did an event $e$ cause an event ${e}^{\text{'}}$?)*
has been studied in many disciplines, including philosophy, logic,
statistics, and law. The definitions of causality studied in these
disciplines usually amount to variants of the counterfactual test
“$e$ is a cause of ${e}^{\text{'}}$ if both $e$ and ${e}^{\text{'}}$ have occurred, and in a
world that is as close as possible to the actual world but where $e$
does not occur, ${e}^{\text{'}}$ does not occur either”. In computer science,
almost all definitions of logical causality — including the landmark
definition of [54] and its derivatives — rely
on a causal model that. However, this model may not be known, for
instance in presence of black-box components. For such systems, we
have been developing a framework for blaming that helps us establish
the causal relationship between component failures and system
failures, given an observed system execution trace. The analysis is
based on a formalization of counterfactual
reasoning [6].

In [16] we have discussed several shortcomings
of existing approaches to counterfactual causality from the computer
science perspective, and sketched lines of work to try and overcome
these issues. In particular, research on counterfactual causality
analysis has been marked, since its early days, by a succession of
definitions of causality that are informally (in)validated against
human intuition on mostly simple examples, see
*e.g.*, [54], [53]. We call
this approach TEGAR, *textbook example guided analysis
refinement*. As pointed out in [48], it
suffers from its dependence on the tiny number and incompleteness of
examples in the literature, and from the lack of stability of the intuitive
judgments against which the definitions are validated. We have argued
that we need a formalization of counterfactual causality based on *first principles*, in the sense that causality definitions should
not be driven by individual examples but constructed from a set of
precisely specified requirements. Example of such requirements are
robustness of causation under equivalence of models, and well-defined
behavior under abstraction and refinement. To the best of our
knowledge, none of the existing causality analysis techniques provides
sufficient guarantees in this regard.

We are currently working on a revised version of our general semantic
framework for fault ascription in [50] that
satisfies a set of formally stated requirements, and on its
instantiation to acyclic models of computation, in order to compare
our approach with the standard definition of *actual causality*
proposed by Halpern and Pearl.

#### Tradeoff exploration between energy consumption and execution time

We have continued our work on multi-criteria scheduling, in two
directions. First, in the context of dynamic applications that are
launched and terminated on an embedded homogeneous multi-core chip,
under execution time and energy consumption constraints, we have
proposed a two layer adaptive scheduling
method [14]. In the first layer, each
application (represented as a DAG of tasks) is scheduled statically on
subsets of cores: 2 cores, 3 cores, 4 cores, and so on. For each size
of these sets (2, 3, 4, ...), there may be only one topology or
several topologies. For instance, for 2 or 3 cores there is only one
topology (a “line”), while for 4 cores there are three distinct
topologies (“line”, “square”, and “T shape”). Moreover, for each
topology, we generate statically several schedules, each one subject
to a different total energy consumption constraint, and consequently
with a different Worst-Case Reaction Time (WCRT). Coping with the
energy consumption constraints is achieved thanks to Dynamic Frequency
and Voltage Scaling (DVFS). In the second layer, we use these
pre-generated static schedules to reconfigure dynamically the
applications running on the multi-core each time a new application is
launched or an existing one is stopped. The goal of the second layer
is to perform a dynamic global optimization of the configuration, such
that each running application meets a pre-defined quality-of-service
constraint (translated into an upper bound on its WCRT) and such that
the total energy consumption be minimized. For this, we *(i)*
allocate a sufficient number of cores to each active application,
*(ii)* allocate the unassigned cores to the applications yielding
the largest gain in energy, and *(iii)* choose for each
application the best topology for its subset of cores (*i.e.*, better than
the by default “line” topology). This is a joint work with Ismail Assayad (U. Casablanca, Morocco) who visits the team regularly.

Second, we have proposed the first of its kind multi-criteria
scheduling heuristics for a DAG of tasks onto an homogeneous
multi-core chip, optimizing the execution time, the reliability, the
power consumption, and the temperature. Specifically, we have worked
on the static scheduling minimizing the execution time of the
application under the multiple constraints that the reliability, the
power consumption, and the temperature remain below some given
thresholds. There are multiple difficulties: *(i)* the
reliability is not an invariant measure w.r.t. time, which makes it
impossible to use backtrack-free scheduling algorithms such as list
scheduling [28]; to overcome this, we adopt instead the
Global System Failure Rate (GSFR) as a measure of the system's
reliability, which is invariant with time [46];
*(ii)* keeping the power consumption under a given threshold
requires to lower the voltage and frequency, but this has a negative
impact both on the execution time and on the GSFR; keeping the GSFR
below a given threshold requires to replicate the tasks on multiple
cores, but this has a negative impact both on the execution time, on
the power consumption, and on the temperature; *(iii)* keeping
the temperature below a given threshold is even more difficult because
the temperature continues to increase even after the activity stops,
so each scheduling decision must be assessed not based on the current
state of the chip (*i.e.*, the temperature of each core) but on the state
of the chip at the end of the candidate task, and cooling slacks must
be inserted. We have proposed a multi-criteria scheduling heuristics
to address these challenges. It produces a static schedule of the
given application graph and the given architecture description, such
that the GSFR, power, and temperature thresholds are satisfied, and
such that the execution time is minimized. We then combine our
heuristic with a variant of the $\epsilon $-constraint
method [52] in order to produce, for a given application
graph and a given architecture description, its entire Pareto front in
the 4D space (exec. time, GSFR, power, temp.). This is a joint work
with Athena Abdi and Hamid Zarandi from Amirkabir U., Iran, who have visited the team
in 2016.

#### Concurrent flexible reversibility

Reversible concurrent models of computation provide natively what appears to be very fine-grained checkpoint and recovery capabilities. We have made this intuition clear by formally comparing a distributed algorithm for checkpointing and recovery based on causal information, and the distributed backtracking algorithm that lies at the heart of our reversible higher-order pi-calculus. We have shown that (a variant of) the reversible higher-order calculus with explicit rollback can faithfully encode a distributed causal checkpoint and recovery algorithm. The reverse is also true but under precise conditions, which restrict the ability to rollback a computation to an identified checkpoint. This work has currently not been published.