## Section: New Results

### Performance Evaluation

**Participants:** Yann Busnel, Yves Mocquard, Bruno Sericola, Gerardo Rubino

**Correlation estimation between distributed massive streams.**
The real time analysis of massive data streams is of utmost importance
in data intensive applications that need to detect as fast as possible
and as efficiently as possible (in terms of computation and memory
space) any correlation between its inputs or any deviance from some
expected nominal behavior. The IoT infrastructure can be used for
monitoring any events or changes in structural conditions that can
compromise safety and increase risk. It is thus a recurrent and
crucial issue to determine whether huge data streams, received at
monitored devices, are correlated or not as it may reveal the presence
of attacks. In [14] we propose a metric,
called *Codeviation*, that allows to evaluate the correlation
between distributed massive streams. This metric is inspired from
classical material in statistics and probability theory, and as such
enables to understand how observed quantities change together, and in
which proportion. We then propose to estimate the codeviation in the
data stream model. In this model, functions are estimated on a huge
sequence of data items, in an online fashion, and with a very small
amount of memory with respect to both the size of the input stream and
the domain from which data items are drawn. We then generalize
our approach by presenting a new metric, the *Sketch-$\u2606$
metric*, which allows us to define a distance between updatable
summaries of large data streams. An important feature of the
*Sketch-$\u2606$ metric* is that, given a measure on the entire
initial data streams, the *Sketch-$\u2606$ metric* preserves the
axioms of the latter measure on the sketch. We also conducted
extensive experiments on both synthetic traces and real data sets
allowing us to validate the robustness and accuracy of our metrics.

**Stream processing systems.** Stream processing systems are
today gaining momentum as tools to perform analytics on continuous
data streams. Their ability to produce analysis results with
sub-second latencies, coupled with their scalability, makes them the
preferred choice for many big data companies.

A stream processing application is commonly modeled as a direct acyclic graph where data operators, represented by nodes, are interconnected by streams of tuples containing data to be analyzed, the directed edges (the arcs). Scalability is usually attained at the deployment phase where each data operator can be parallelized using multiple instances, each of which will handle a subset of the tuples conveyed by the operators’ ingoing stream. Balancing the load among the instances of a parallel operator is important as it yields to better resource utilization and thus larger throughputs and reduced tuple processing latencies.

*Shuffle grouping* is a technique used by stream processing
frameworks to share input load among parallel instances of stateless
operators. With shuffle grouping each tuple of a stream can be
assigned to any available operator instance, independently from any
previous assignment. A common approach to implement shuffle grouping
is to adopt a Round-Robin policy, a simple solution that fares well as
long as the tuple execution time is almost the same for all the
tuples. However, such an assumption rarely holds in real cases where
execution time strongly depends on tuple content. As a consequence,
parallel stateless operators within stream processing applications may
experience unpredictable unbalance that, in the end, causes
undesirable increase in tuple completion times.
In [61] we propose Online Shuffle Grouping
(OSG), a novel approach to shuffle grouping aimed at reducing the
overall tuple completion time. OSG estimates the execution time of
each tuple, enabling a proactive and online scheduling of input load
to the target operator instances. Sketches are used to efficiently
store the otherwise large amount of information required to schedule
incoming load. We provide a probabilistic analysis and illustrate,
through both simulations and a running prototype, its impact on stream
processing applications.

*Grand Challenge.* Since 2011, the ACM International Conference
on Distributed Event-based Systems (DEBS) launched the Grand Challenge
series to increase the focus on these systems as well as provide
common benchmarks to evaluate and compare them. The ACM DEBS 2017
Grand Challenge focused on (soft) real-time anomaly detection in
manufacturing equipment. To handle continuous monitoring, each machine
is fitted with a vast array of sensors, either digital or analog.
These sensors provide periodic measurements, which are sent to a
monitoring base station. The latter receives then a large collection
of observations. Analyzing in an efficient and accurate way, this
very-high-rate – and potentially massive – stream of events is the
core of the Grand Challenge. Although, the analysis of a massive
amount of sensor reading requires an on-line analytics pipeline that
deals with linked-data, clustering as well as a Markov model training
and querying. The FlinkMan system [62] proposes
a solution to the 2017 Grand Challenge, making use of a publicly
available streaming engine and thus offering a generic solution that
is not specially tailored for this or for another challenge. We offer an
efficient solution that maximally utilizes available cores, balances
the load among the cores, and avoids to the extent possible tasks such
as garbage collection that are only indirectly related to the task at
hand.

*Health big data processing.* Sharing and exploiting efficiently
Health Big Data (HBD) lead to tackle great challenges: data protection
and governance taking into account legal, ethical and deontological
aspects which enables a trust, transparent and win-to-win relationship
between researchers, citizen and data providers. Lack of
interoperability: data are compartmentalized and are so syntactically
and semantically heterogeneous. Variable data quality with a great
impact on data management and statistical analysis. The objective of
the INSHARE project [41] is to explore,
through an experimental proof of concept, how recent technologies
could overcome such issues. It aims at demonstrating the feasibility
and the added value of an IT platform based on CDW, dedicated to
collaborative HBD sharing for medical research.

The consortium includes 6 data providers: 2 academic hospitals, the SNIIRAM (the French national reimbursement database) and 3 national or regional registries. The platform is designed following a three steps approach: (1) to analyze use cases, needs and requirements, (2) to define data sharing governance and secure access to the platform, (3) to define the platform specifications. Three use cases (healthcare trajectory analysis, epidemiological registry enrichment, signal detection) were analyzed to design the platform corresponding to five studies and using eleven data sources. The governance was derived from the SCANNER model and adapted to data sharing. As a result, the platform architecture integrates the following tools and services: data repository and hosting, semantic integration services, data processing, aggregate computing, data quality and integrity monitoring, id linking, multi-source query builder, visualization and data export services, data governance, study management service and security including data watermarking.

**Throughput prediction in cellular networks.** Downlink data
rates can vary significantly in cellular networks, with a potentially
non-negligible effect on the user experience. Content providers
address this problem by using different representations (*e.g.*,
picture resolution, video resolution and rate) of the same content and
by switching among these based on measurements collected during the
connection. If it were possible to know the achievable data rate
before the connection establishment, content providers could choose
the most appropriate representation from the very beginning. We have
conducted a measurement campaign involving 60 users connected to a
production network in France, to determine whether it is possible to
predict the achievable data rate using measurements collected, before
establishing the connection to the content provider, on the operator’s
network and on the mobile node. We show that it is indeed possible to
exploit these measurements to predict, with a reasonable accuracy, the
achievable data rate [53].

**Population protocol model.** We consider in
[50] a large system populated by $n$ anonymous
nodes that communicate through asynchronous and pairwise
interactions. The aim of these interactions is, for each node, to
converge toward a global property of the system that depends on the
initial state of the nodes. We focus on both the counting and
proportion problems. We show that for any $\delta \in (0,1)$, the
number of interactions needed per node to converge is
$O(ln(n/\delta \left)\right)$ with probability at least $1-\delta $. We also prove
that each node can determine, with any high probability, the
proportion of nodes that initially started in a given state without
knowing the number of nodes in the system. This work provides a
precise analysis of the convergence bounds, and shows that using the
4-norm is very effective to derive useful bounds.

The context of [71] is the well studied
dissemination of information in large scale distributed networks
through pairwise interactions. This problem, originally called
*rumor mongering*, and then *rumor spreading* has mainly
been investigated in the synchronous model, which relies on the
assumption that all the nodes of the network act in synchrony, that
is, at each round of the protocol, each node is allowed to contact a
random neighbor. In this paper, we drop this assumption under the
argument that it is not realistic in large scale systems. We thus
consider the asynchronous variant, where, at random times, nodes
successively interact by pairs exchanging their information on the
rumor. In a previous paper, we performed a study of the total number
of interactions needed for all the nodes of the network to discover
the rumor. While most of the existing results involve huge constants
that do not allow us to compare different protocols, we provided a
thorough analysis of the distribution of this total number of
interactions together with its asymptotic behavior. In this paper we
extend this discrete-time analysis by solving a conjecture proposed
previously and we consider the continuous-time case, where a Poisson
process is associated with each node to determine the instants at which
interactions occur. The rumor spreading time is thus more realistic
since it is the time needed for all the nodes of the network to
discover the rumor. Once again, as most of the existing results
involve huge constants, we provide a tight bound and equivalent of the
complementary distribution of the rumor spreading time. We also give
the exact asymptotic behavior of the complementary distribution of the
rumor spreading time around its expected value when the number of
nodes tends to infinity.

**Transient analysis.**
Last, in two keynotes ([35] and
[34]), we described part of our previous
analytical results concerning the transient behavior of
well-structured Markov processes, mainly on performance models
(queueing systems), and we presented recent new results that extend
those initial findings. The heart of the novelties lie on an extension
of the concept of duality proposed by Anderson
in [73] that we call pseudo-dual. The dual of a
stochastic process needs strong monotonicity conditions to exist. Our
proposed pseudo-dual always exist, and is directly defined on a linear
system of differential equations with constant coefficients, that can
be, in particular, the system of Chapman-Kolmogorov equations
corresponding to a Markov process, but not necessarily. This allows,
for instance, to prove the validity of closed-forms expressions of the
transient distribution of a Markov process in cases where the dual
doesn't exist. The keynote [35] was presented
to a public oriented toward differential equations and dynamical
systems; [34] has a more modeling flavour.
A paper is under preparation with the technical details.