## Section: New Results

### Semi and non-parametric methods

#### Robust estimation for extremes

Participants : Clément Albert, Stéphane Girard.

**Joint work with:** M. Stehlik (Johannes Kepler Universitat Linz, Austria and Universidad de Valparaiso, Chile)
and A. Dutfoy (EDF R&D).

In the PhD thesis of Clément Albert (funded by EDF), we study the sensitivity of extreme-value methods to small changes in the data and we investigate their extrapolation ability [36], [37]. To reduce this sensitivity, robust methods are needed and we proposed a novel method of heavy tails estimation based on a transformed score (the t-score). Based on a new score moment method, we derive the t-Hill estimator, which estimates the extreme value index of a distribution function with regularly varying tail. t-Hill estimator is distribution sensitive, thus it differs in e.g. Pareto and log-gamma case. Here, we study both forms of the estimator, i.e. t-Hill and t-lgHill. For both estimators we prove weak consistency in moving average settings as well as the asymptotic normality of t-lgHill estimator in the i.i.d. setting. In cases of contamination with heavier tails than the tail of original sample, t-Hill outperforms several robust tail estimators, especially in small sample situations. A simulation study emphasizes the fact that the level of contamination is playing a crucial role. We illustrate the developed methodology on a small sample data set of stake measurements from Guanaco glacier in Chile. This methodology is adapted to bounded distribution tails in [26] with an application to extreme snow loads in Slovakia.

#### Conditional extremal events

Participant : Stéphane Girard.

**Joint work with:** L. Gardes (Univ. Strasbourg) and J. Elmethni (Univ. Paris 5)

The goal of the PhD theses of Alexandre Lekina and Jonathan El Methni was to contribute to
the development of theoretical and algorithmic models to tackle
conditional extreme value analysis, *ie* the situation where
some covariate information $X$ is recorded simultaneously with a
quantity of interest $Y$. In such a case, the tail heaviness of
$Y$ depends on $X$, and thus the tail index as well as the extreme
quantiles are also functions of the covariate. We combine
nonparametric smoothing techniques [67] with
extreme-value methods in order to obtain efficient estimators of
the conditional tail index and conditional extreme quantiles [61].

#### Estimation of extreme risk measures

Participant : Stéphane Girard.

**Joint work with:** A. Daouia (Univ. Toulouse), L. Gardes
(Univ. Strasbourg), J. Elmethni (Univ. Paris 5) and G. Stupfler (Univ. Nottingham, UK).

One of the most popular risk measures is the Value-at-Risk (VaR) introduced in the 1990's. In statistical terms, the VaR at level $\alpha \in (0,1)$ corresponds to the upper $\alpha $-quantile of the loss distribution. The Value-at-Risk however suffers from several weaknesses. First, it provides us only with a pointwise information: VaR($\alpha $) does not take into consideration what the loss will be beyond this quantile. Second, random loss variables with light-tailed distributions or heavy-tailed distributions may have the same Value-at-Risk. Finally, Value-at-Risk is not a coherent risk measure since it is not subadditive in general. A first coherent alternative risk measure is the Conditional Tail Expectation (CTE), also known as Tail-Value-at-Risk, Tail Conditional Expectation or Expected Shortfall in case of a continuous loss distribution. The CTE is defined as the expected loss given that the loss lies above the upper $\alpha $-quantile of the loss distribution. This risk measure thus takes into account the whole information contained in the upper tail of the distribution. In [61], we investigate the extreme properties of a new risk measure (called the Conditional Tail Moment) which encompasses various risk measures, such as the CTE, as particular cases. We study the situation where some covariate information is available under some general conditions on the distribution tail. We thus has to deal with conditional extremes (see paragraph 7.2.2).

A second possible coherent alternative risk measure is based on expectiles [18]. Compared to quantiles, the family of expectiles is based on squared rather than absolute error loss minimization. The flexibility and virtues of these least squares analogues of quantiles are now well established in actuarial science, econometrics and statistical finance. Both quantiles and expectiles were embedded in the more general class of M-quantiles [19] as the minimizers of a generic asymmetric convex loss function. It has been proved very recently that the only M-quantiles that are coherent risk measures are the expectiles.

#### Level sets estimation

Participant : Stéphane Girard.

**Joint work with:** G. Stupfler (Univ. Nottingham, UK).

The boundary bounding the set of points is viewed as the larger level set of the points distribution. This is then an extreme quantile curve estimation problem. We proposed estimators based on projection as well as on kernel regression methods applied on the extreme values set, for particular set of points [10]. We also investigate the asymptotic properties of existing estimators when used in extreme situations. For instance, we have established that the so-called geometric quantiles have very counter-intuitive properties in such situations [21] and thus should not be used to detect outliers.

#### Approximate Bayesian inference

Participant : Julyan Arbel.

**Joint work with:**
Igor Prünster, Stefano Favaro.

Approximate Bayesian inference was tackled from two perspectives.

First, from a computational viewpoint, we have proposed an algorithm which allows for controlling the approximation error in Bayesian nonparametric posterior sampling. In [14], we show that completely random measures (CRM) represent the key building block of a wide variety of popular stochastic models and play a pivotal role in modern Bayesian Nonparametrics. The popular Ferguson & Klass representation ofCRMs as a random series with decreasing jumps can immediately be turned into an algorithm for sampling realizations of CRMs or more elaborate models involving transformedCRMs. However, concrete implementation requires to truncate the random series at some threshold resulting in an approximation error. The goal of this work is to quantify the quality of the approximation by a moment-matching criterion, which consists in evaluating a measure of discrepancy between actual moments and moments based on the simulation output. Seen as a function of the truncation level, the methodology can be used to determine the truncation level needed to reach a certain level of precision. The resulting moment-matching Ferguson & Klass algorithm is then implemented and illustrated on several popular Bayesian nonparametric models.

In [57], we focus on the truncation error of a superposed gamma process in a decreasing order representation. As in [14], we utilize the constructive representation due to Ferguson and Klass which provides the jumps of the series in decreasing order. This feature is of primary interest when it comes to sampling since it minimizes the truncation error for a fixed truncation level of the series. We quantify the quality of the approximation in two ways. First, we derive a bound in probability for the truncation error. Second, we study a moment-matching criterion which consists in evaluating a measure of discrepancy between actual moments of the CRM and moments based on the simulation output. This work focuses on a general class of CRMs, namely the superposed gamma process, which suitably transformed have already been successfully implemented in Bayesian Nonparametrics. To this end, we show that the moments of this class of processes can be obtained analytically.

Second, we have proposed an approximation of Gibbs-type random probability measures at the level of the predictive probabilities. Gibbs-type random probability measures are arguably the most “natural" generalization of the Dirichlet process. Among them the two parameter Poisson–Dirichlet process certainly stands out for the mathematical tractability and interpretability of its predictive probability, which made it the natural candidate in numerous applications, e.g., machine learning theory, probabilistic models for linguistic applications, Bayesian nonparametric statistics, excursion theory, measure-valued diffusions in population genetics, combinatorics and statistical physics. Given a sample of size $n$, in this work we show that the predictive probabilities of any Gibbs-type prior admit a large $n$ approximation, with an error term vanishing as $o(1/n)$, which maintains the same desirable features as the predictive probabilities of the two parameter Poisson–Dirichlet prior.

#### Bayesian nonparametric posterior asymptotic behavior

Participant : Julyan Arbel.

**Joint work with:**
Olivier Marchal, Stefano Favaro, Bernardo Nipoti, Yee Whye Teh.

In [24], we obtain the optimal proxy variance for the sub-Gaussianity of Beta distribution, thus proving upper bounds recently conjectured by Elder (2016). We provide different proof techniques for the symmetrical (around its mean) case and the non-symmetrical case. The technique in the latter case relies on studying the ordinary differential equation satisfied by the Beta moment-generating function known as the confluent hypergeometric function. As a consequence, we derive the optimal proxy variance for the Dirichlet distribution, which is apparently a novel result. We also provide a new proof of the optimal proxy variance for the Bernoulli distribution, and discuss in this context the proxy variance relation to log-Sobolev inequalities and transport inequalities.

The article [13] deals with a *Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics*. Given a sample of size $n$ from a population of individual belonging to different species with unknown proportions, a popular problem of practical interest consists in making inference on the probability ${D}_{n}\left(l\right)$ that the $(n+1)$-th draw coincides with a species with frequency $l$ in the sample, for any $l=0,1,...,n$. We explore in this work a Bayesian nonparametric viewpoint for inference of ${D}_{n}\left(l\right)$. Specifically, under the general framework of Gibbs-type priors we show how to derive credible intervals for the Bayesian nonparametric estimator of ${D}_{n}\left(l\right)$, and we investigate the large $n$ asymptotic behavior of such an estimator. We also compare this estimator to the classical Good–Turing estimator.

#### A Bayesian nonparametric approach to ecological risk assessment

Participant : Julyan Arbel.

**Joint work with** Guillaume Kon Kam King, Igor Prünster.

We revisit a classical method for ecological risk assessment, the Species Sensitivity Distribution (SSD) approach, in a Bayesian nonparametric framework. SSD is a mandatory diagnostic required by environmental regulatory bodies from the European Union, the United States, Australia, China etc. Yet, it is subject to much scientific criticism, notably concerning a historically debated parametric assumption for modelling species variability. Tackling the problem using nonparametric mixture models, it is possible to shed this parametric assumption and build a statistically sounder basis for SSD. We use Normalized Random Measures with Independent Increments (NRMI) as the mixing measure because they offer a greater flexibility than the Dirichlet process. Indeed, NRMI can induce a prior on the number of components in the mixture model that is less informative than the Dirichlet process. This feature is consistent with the fact that SSD practitioners do not usually have a strong prior belief on the number of components. In this work, we illustrate the advantage of the nonparametric SSD over the classical normal SSD and a kernel density estimate SSD on several real datasets[59].

#### Machine learning methods for the inversion of hyperspectral images

Participant : Stéphane Girard.

**Joint work with:** S. Douté (IPAG, Grenoble), M. Fauvel (INRA, Toulouse) and L. Gardes (Univ. Strasbourg).

We address the physical analysis of planetary hyperspectral images by massive inversion [58]. A direct radiative transfer model that relates a given combination of atmospheric or surface parameters to a spectrum is used to build a training set of synthetic observables. The inversion is based on the statistical estimation of the functional relationship between parameters and spectra. To deal with high dimensionality (image cubes typically present hundreds of bands), a two step method is proposed, namely K-GRSIR. It consists of a dimension reduction step followed by a regression with a non-linear least-squares algorithm. The dimension reduction is performed with the Gaussian Regularized Sliced Inverse Regression algorithm, which finds the most relevant directions in the space of synthetic spectra for the regression. The method is compared to several algorithms: a regularized version of k-nearest neighbors, partial least-squares, linear and non-linear support vector machines. Experimental results on simulated data sets have shown that non-linear support vector machines is the most accurate method followed by K-GRSIR. However, when dealing with real data sets, K-GRSIR gives the most interpretable results and is easier to train.

#### Multi sensor fusion for acoustic surveillance and monitoring

Participants : Florence Forbes, Jean-Michel Bécu.

**Joint work with:** Pascal Vouagner and Christophe Thirard from ACOEM company.

In the context of the DGA-rapid WIFUZ project, we addressed the issue of determining the localization of shots from multiple measurements coming from multiple sensors. The WIFUZ project is a collaborative work between various partners: DGA, ACOEM and HIKOB companies and Inria. This project is at the intersection of data fusion, statistics, machine learning and acoustic signal processing. The general context is the surveillance and monitoring of a zone acoustic state from data acquired at a continuous rate by a set of sensors that are potentially mobile and of different nature. The overall objective is to develop a prototype for surveillance and monitoring that is able to combine multi sensor data coming from acoustic sensors (microphones and antennas) and optical sensors (infrared cameras) and to distribute the processing to multiple algorithmic blocs. As an illustration, the mistis contribution is to develop technical and scientific solutions as part of a collaborative protection approach, ideally used to guide the best coordinated response between the different vehicles of a military convoy. Indeed, in the case of an attack on a convoy, identifying the threatened vehicles and the origin of the threat is necessary to organize the best response from all members on the convoy. Thus it will be possible to react to the first contact (emergency detection) to provide the best answer for threatened vehicles (escape, lure) and for those not threatened (suppression fire, riposte fire). We developed statistical tools that make it possible to analyze this information (characterization of the threat) using fusion of acoustic and image data from a set of sensors located on various vehicles. We used Bayesian inversion and simulation techniques to recover multiple sources mimicking collaborative interaction between several vehicles.

#### Extraction and data analysis toward "industry of the future"

Participants : Florence Forbes, Hongliang Lu, Fatima Fofana, Gildas Mazo, Jaime Eduardo Arias Almeida.

**Joint work with:** J. F. Cuccaro and J. C Trochet from Vi-Technology company.

Industry as we know it today will soon disappear. In the future, the machines which constitute the manufacturing process will communicate automatically as to optimize its performance as whole. Transmitted information essentially will be of statistical nature. In the context of VISION 4.0 project with Vi-Technology, the role of mistis is to identify what statistical methods might be useful for the printed circuits boards assembly industry. The topic of F. Fofana's internship was to extract and analyze data from two inspection machines of a industrial process making electronic cards. After a first extraction step in the SQL database, the goal was to enlighten the statistical links between these machines. Preliminary experiments and results on the Solder Paste Inspection (SPI) step, at the beginning of the line, helped identifying potentially relevant variables and measurements (eg related to stencil offsets) to identify future defects and discriminate between them. More generally, we have access to two databases at both ends (SPI and Component Inspection) of the assembly process. The goal is to improve our understanding of interactions in the assembly process, find out correlations between defects and physical measures, generate proactive alarms so as to detect departures from normality.