The Context of our work is the analysis of structured stochastic models with statistical tools. The idea underlying the concept of structure is that stochastic systems that exhibit great complexity can be accounted for by combining simple local assumptions in a coherent way. This provides a key to modelling, computation, inference and interpretation. This approach appears to be useful in a number of high impact applications including signal and image processing, neuroscience, genomics, sensors networks, etc. while the needs from these domains can in turn generate interesting theoretical developments. However, these powerful and flexible approach can still be restricted by necessary simplifying assumptions and several generic sources of complexity in data.

Often data exhibit complex dependence structures, having to do for example with repeated measurements on individual items, or natural grouping of individual observations due to the method of sampling, spatial or temporal association, family relationship, and so on. Other sources of complexity are connected with the measurement process, such as having multiple measuring instruments or simulations generating high dimensional and heterogeneous data or such that data are dropped out or missing. Such complications in data-generating processes raise a number of challenges. Our goal is to contribute to statistical modelling by offering theoretical concepts and computational tools to handle properly some of these issues that are frequent in modern data. So doing, we aim at developing innovative techniques for high scientific, societal, economic impact applications and in particular via image processing and spatial data analysis in environment, biology and medicine.

The methods we focus on involve mixture models, Markov models, and more generally hidden structure models identified by stochastic algorithms on one hand, and semi and non-parametric methods on the other hand.

Hidden structure models are useful for taking into account heterogeneity in data. They concern many areas of statistics (finite mixture analysis, hidden Markov models, graphical models, random effect models, ...). Due to their missing data structure, they induce specific difficulties for both estimating the model parameters and assessing performance. The team focuses on research regarding both aspects. We design specific algorithms for estimating the parameters of missing structure models and we propose and study specific criteria for choosing the most relevant missing structure models in several contexts.

Semi and non-parametric methods are relevant and useful when no
appropriate parametric model exists for the data under study
either because of data complexity, or because information is
missing.
When observations are curves, they enable us to model the
data without a discretization step. These
techniques are also of great use for *dimension reduction* purposes. They enable dimension reduction of the
functional or multivariate data with no assumptions on the
observations distribution. Semi-parametric methods refer to
methods that include both parametric and non-parametric aspects.
Examples include the Sliced Inverse Regression (SIR) method which combines non-parametric regression techniques
with parametric dimension reduction aspects. This is also the case
in *extreme value analysis*, which is based
on the modelling of distribution tails
by both a functional part and a real parameter.

**Key-words:**
mixture of distributions, EM algorithm, missing data, conditional independence,
statistical pattern recognition, clustering,
unsupervised and partially supervised learning.

In a first approach, we consider statistical parametric models,

These models are interesting in that they may point out hidden
variable responsible for most of the observed variability and so
that the observed variables are *conditionally* independent.
Their estimation is often difficult due to the missing data. The
Expectation-Maximization (EM) algorithm is a general and now
standard approach to maximization of the likelihood in missing
data problems. It provides parameter estimation but also values
for missing data.

Mixture models correspond to independent

**Key-words:**
graphical models, Markov properties, hidden Markov models, clustering, missing data, mixture of distributions, EM algorithm, image analysis, Bayesian
inference.

Graphical modelling provides a diagrammatic representation of the dependency structure of a joint probability distribution, in the form of a network or graph depicting the local relations among variables. The graph can have directed or undirected links or edges between the nodes, which represent the individual variables. Associated with the graph are various Markov properties that specify how the graph encodes conditional independence assumptions.

It is the conditional independence assumptions that give graphical models their fundamental modular structure, enabling computation of globally interesting quantities from local specifications. In this way graphical models form an essential basis for our methodologies based on structures.

The graphs can be either
directed, e.g. Bayesian Networks, or undirected, e.g. Markov Random Fields.
The specificity of Markovian models is that the dependencies
between the nodes are limited to the nearest neighbor nodes. The
neighborhood definition can vary and be adapted to the problem of
interest. When parts of the variables (nodes) are not observed or missing,
we
refer to these models as Hidden Markov Models (HMM).
Hidden Markov chains or hidden Markov fields correspond to cases where the

Hidden Markov models are very useful in modelling spatial dependencies but these dependencies and the possible existence of hidden variables are also responsible for a typically large amount of computation. It follows that the statistical analysis may not be straightforward. Typical issues are related to the neighborhood structure to be chosen when not dictated by the context and the possible high dimensionality of the observations. This also requires a good understanding of the role of each parameter and methods to tune them depending on the goal in mind. Regarding estimation algorithms, they correspond to an energy minimization problem which is NP-hard and usually performed through approximation. We focus on a certain type of methods based on variational approximations and propose effective algorithms which show good performance in practice and for which we also study theoretical properties. We also propose some tools for model selection. Eventually we investigate ways to extend the standard Hidden Markov Field model to increase its modelling power.

**Key-words:** dimension reduction, extreme value analysis, functional estimation.

We also consider methods which do not assume a parametric model.
The approaches are non-parametric in the sense that they do not
require the assumption of a prior model on the unknown quantities.
This property is important since, for image applications for
instance, it is very difficult to introduce sufficiently general
parametric models because of the wide variety of image contents.
Projection methods are then a way to decompose the unknown
quantity on a set of functions (*e.g.* wavelets). Kernel
methods which rely on smoothing the data using a set of kernels
(usually probability distributions) are other examples.
Relationships exist between these methods and learning techniques
using Support Vector Machine (SVM) as this appears in the context
of *level-sets estimation* (see section ). Such
non-parametric methods have become the cornerstone when dealing
with functional data . This is the case, for
instance, when observations are curves. They enable us to model the
data without a discretization step. More generally, these
techniques are of great use for *dimension reduction* purposes
(section ). They enable reduction of the dimension of the
functional or multivariate data without assumptions on the
observations distribution. Semi-parametric methods refer to
methods that include both parametric and non-parametric aspects.
Examples include the Sliced Inverse Regression (SIR) method
which combines non-parametric regression techniques
with parametric dimension reduction aspects. This is also the case
in *extreme value analysis* , which is based
on the modelling of distribution tails (see section ).
It differs from traditional statistics which focuses on the central
part of distributions, *i.e.* on the most probable events.
Extreme value theory shows that distribution tails can be
modelled by both a functional part and a real parameter, the
extreme value index.

Extreme value theory is a branch of statistics dealing with the extreme
deviations from the bulk of probability distributions.
More specifically, it focuses on the limiting distributions for the
minimum or the maximum of a large collection of random observations
from the same arbitrary distribution.
Let *i.e.*

To estimate such quantiles therefore requires dedicated
methods to
extrapolate information beyond the observed values of

where both the extreme-value index *i.e.* such that

for all

More generally, the problems that we address are part of the risk management theory. For instance, in reliability, the distributions of interest are included in a semi-parametric family whose tails are decreasing exponentially fast. These so-called Weibull-tail distributions are defined by their survival distribution function:

Gaussian, gamma, exponential and Weibull distributions, among others,
are included in this family. An important part of our work consists
in establishing links between models () and ()
in order to propose new estimation methods.
We also consider the case where the observations were recorded with a covariate information. In this case, the extreme-value index and the

Level sets estimation is a
recurrent problem in statistics which is linked to outlier
detection. In biology, one is interested in estimating reference
curves, that is to say curves which bound

Our work on high dimensional data requires that we face the curse of dimensionality phenomenon. Indeed, the modelling of high dimensional data requires complex models and thus the estimation of high number of parameters compared to the sample size. In this framework, dimension reduction methods aim at replacing the original variables by a small number of linear combinations with as small as a possible loss of information. Principal Component Analysis (PCA) is the most widely used method to reduce dimension in data. However, standard linear PCA can be quite inefficient on image data where even simple image distorsions can lead to highly non-linear data. Two directions are investigated. First, non-linear PCAs can be proposed, leading to semi-parametric dimension reduction methods . Another field of investigation is to take into account the application goal in the dimension reduction step. One of our approaches is therefore to develop new Gaussian models of high dimensional data for parametric inference . Such models can then be used in a Mixtures or Markov framework for classification purposes. Another approach consists in combining dimension reduction, regularization techniques, and regression techniques to improve the Sliced Inverse Regression method .

As regards applications, several areas of image analysis can be covered using the tools developed in the team. More specifically, in collaboration with team Perception, we address various issues in computer vision involving Bayesian modelling and probabilistic clustering techniques. Other applications in medical imaging are natural. We work more specifically on MRI data, in collaboration with the Grenoble Institute of Neuroscience (GIN) and the NeuroSpin center of CEA Saclay. We also consider other statistical 2D fields coming from other domains such as remote sensing, in collaboration with Laboratoire de Planétologie de Grenoble. We worked on hyperspectral images. In the context of the "pole de competivite" project I-VP, we worked of images of PC Boards.

A second domain of applications concerns biology and medicine. We consider the use of missing data models in epidemiology. We also investigated statistical tools for the analysis of bacterial genomes beyond gene detection. Applications in neurosiences are also considered. Finally, in the context of the ANR VMC project Medup, we studied the uncertainties on the forecasting and climate projection for Mediterranean high-impact weather events.

**Joint work with:** Senan Doyle (start-up creator) and Michel Dojat from Grenoble Institute of Neuroscience and Benoit Scherrer from Harvard Medical School, Boston, MA, USA.

From brain MR images, neuroradiologists are able to delineate
tissues such as grey matter and structures such as Thalamus and
damaged regions. This delineation is a common task for an expert
but unsupervised segmentation is difficult due to a number of
artefacts. The LOCUS software (http://

The LOCUS software has been developed in the context of a collaboration between Mistis, a computer science team (Magma, LIG) and a Neuroscience methodological team (the Neuroimaging team from Grenoble Institut of Neurosciences, INSERM). This collaboration resulted over the period 2006-2008 into the PhD thesis of B. Scherrer (advised by C. Garbay and M. Dojat) and in a number of publications. In particular, B. Scherrer received a "Young Investigator Award" at the 2008 MICCAI conference.

The originality of this work comes from the successful combination of the teams respective strengths i.e. expertise in distributed computing, in neuroimaging data processing and in statistical methods.

**Joint work with:** Senan Doyle (start-up creator) and Michel Dojat.

The Locus software was extended to address the delineation of lesions in pathological brains. Its extension
P-LOCUS (http://

it is fully automatic: no external user interaction and no training data required

the possibility to combine information from several images (MR sequences)

a statistical Bayesian framework for robustness to image artefacts and a priori knowledge incorporation

a voxel-based clustering technique that uses Markov random fields (MRF) incorporating information about neighboring voxels for spatial consistency and robustness to imperfect image features (noise).

the possibility to select and incorporate relevant a priori knowledge via different atlases, e.g. tissue and vascular territory atlases

a fully integrated preprocessing steps and lesion ROI identification

P-LOCUS software was presented at various conferences and used for the BRATS Challenge on tumor segmentation organized as a satellite challenge of the Miccai conference in Nagoya, Japan. A paper published in IEEE trans. on Medical Imaging reports the challenge results . Results are also shown in . The software has been registered at APP in 2013 and is now undergoing industrial development for the creation of a start-up (Pixyl) expected in January 2015.

**Joint work with:** Philippe Ciuciu and Solveig Badillo from Parietal Team Inria and CEA NeuroSpin, Lotfi Chaari and Laurent Risser from INP Toulouse.

As part of fMRI data analysis, the PyHRF package (http://

**Joint work with:** Charles Bouveyron (Univ. Paris 5) and Stéthane Dépréaux (LJK).

Mistis is involved in the development of several R packages available on the CRAN archive. They are dedicated to the construction of copulas and to the classification and clustering of data.

**PBC** (product of bivariate copulas). http://

**FDG** (one-Factor copulas with Durante Generators). http://

**HDclassif** (classification and clustering methods for high dimensional data). http://

**robustDA** (robust mixture discriminant analysis). http://

**MSST** (Mixtures of multiple scaled Student distributions). The package is not yet available on the CRAN but should be early 2015. It implements more efficiently the models and inference procedures described in and will be used on large data sets of brain MRI in the context of Alexis Arnaud PhD thesis. This is joint work with S. Dépréaux who helped with writing subroutines in C++.

The work on the P-Locus software has been exploited in order to create a start-up in January 2015. The project called Pixyl have been accepted by the GATE1 incubator and has been awarded a BPI emergence prize. It is leaded by Senan Doyle (future CEO). The other co-founders are Michel Dojat (INSERM, GIN), Florence Forbes (Inria, Mistis) and IT-Translation.

**Joint work with:**
Emma Holian (National University of Ireland, Galway)

In studies where subjects contribute more than one observation, such as in longitudinal studies, linear mixed models have become one of the most used techniques to take into account the correlation between these observations. By introducing random effects, mixed models allow the within-subject correlation and the variability of the response among the different subjects to be taken into account. However, such models are based on a normality assumption for the random effects and reflect the prior belief of homogeneity among all the subjects. To relax this strong assumption, Verbeke and Lesaffre (1996) proposed the extension of the classical linear mixed model by allowing the random effects to be sampled from a finite mixture of normal distributions with common covariance matrix. This extension naturally arises from the prior belief of the presence of unobserved heterogeneity in the random effects population. The model is therefore called the heterogeneity linear mixed model. Note that this model does not only extend the assumption about the random effects distribution, indeed, each component of the mixture can be considered as a cluster containing a proportion of the total population. Thus, this model is also suitable for classification purposes.

Concerning parameter estimation in the heterogeneity model, the use of the EM-algorithm, which takes into account the incomplete structure of the data, has been considered in the literature. Unfortunately, the M-step in the estimation process is not available in analytic form and a numerical maximisation procedure such as Newton-Raphson is needed. Because deriving such a procedure is a non-trivial task, Komarek et al. (2002) proposed an approximate optimization. But this procedure proved to be very slow and limited to small samples due to requiring manipulation of very large matrices and prohibitive computation.

To overcome this problem, we have proposed in an alternative approach which consists of fitting directly an equivalent mixture of linear mixed models. Contrary to the heterogeneity model, the M-step of the EM-algorithm is tractable analytically in this case. Then, from the obtained parameter estimates, we can easily obtain the parameter estimates in the heterogeneity model.

**Joint work with:** C. Bouveyron (Univ. Paris 5), M. Fauvel (ENSAT Toulouse)
and J. Chanussot (Gipsa-lab and Grenoble-INP)

In the PhD work of Charles Bouveyron (co-advised by Cordelia Schmid from the Inria LEAR team) , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:

the introduction in the model of a dimension reduction constraint for each group

the use of parsimonious models obtained by imposing to different groups to share the same values of some parameters

This modelling yields a new supervised classification method called High Dimensional Discriminant Analysis (HDDA) . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named High Dimensional Data Clustering (HDDC) . Our recent work consists in adding a kernel in the previous methods to deal with nonlinear data classification and heterogeneous data . We also investigate the use of kernels derived from similary measures on binary data. The targeted application is the analysis of verbal autopsy data (PhD thesis of N. Sylla): Indeed, health monitoring and evaluation make more and more use of data on causes of death from verbal autopsies in countries which do not keep records of civil status or with incomplete records. The application of verbal autopsy method allows to discover probable cause of death. Verbal autopsy has become the main source of information on causes of death in these populations.

**Joint work with:** Darren Wraith from QUT, Brisbane Australia.

Clustering concerns the assignment of each of

**Joint work with:** Emmanuel Barbier and Benjamin Lemasson from Grenoble Institute of Neuroscience.

**Joint work with:** Israel Gebru, Xavier Alameda-Pined and Radu Horaud from the
Inria Perception team.

Data clustering has received a lot of attention and many methods, algorithms and software packages are currently available. Among these techniques, parametric finite-mixture models play a central role due to their interesting mathematical properties and to the existence of maximum-likelihood estimators based on expectation-maximization (EM). In this work we propose a new mixture model that associates a weight with each observed data point. We introduce a Gaussian mixture with weighted data and we derive two EM algorithms: the first one assigns a fixed weight to each observed datum, while the second one treats the weights as hidden variables drawn from gamma distributions. We provide a general-purpose scheme for weight initialization and we thoroughly validate the proposed algorithms by comparing them with several parametric and non-parametric clustering techniques. We demonstrate the utility of our method for clustering heterogeneous data, namely data gathered with different sensorial modalities, e.g., audio and vision. See also an application in .

**Joint work with:** Philippe Ciuciu from Team Parietal and
Neurospin, CEA in Saclay.

ASL fMRI data provides a quantitative measure of blood perfusion, that can be correlated to neuronal activation. In contrast to BOLD measure, it is a direct measure of cerebral blood flow. However, ASL data has a lower SNR and resolution so that the recovery of the perfusion response of interest suffers from the contamination by a stronger BOLD component in the ASL signal. In this work , we consider a model of both BOLD and perfusion components within the ASL signal. A physiological link between these two components is analyzed and used for a more accurate estimation of the perfusion response function in particular in the usual ASL low SNR conditions.

**Joint work with:** Philippe Ciuciu from Team Parietal and
Neurospin, CEA in Saclay.

Physiological models have been proposed to describe the processes that underlie the link between neural and hemodynamic activity in the brain. Among these, the Balloon model describes the changes in blood flow, blood volume and oxygen concentration when an hemodynamic response is ensuing neural activation. Next, a *BOLD signal model* links these variables to the measured BOLD signal. Taken together, these equations allow the precise modeling of the coupling between the cerebral blood flow (CBF) and hemodynamic response (HRF). However, several competing versions of BOLD signal model have been described in the past.
In this work, we compare different physiological models linking CBF to HRF and different BOLD signal models too in terms of least squares error and log-likelihood, and we assess the impact of this setting in the context of Arterial Spin Labelling (ASL) functional Magnetic Resonance Imaging (fMRI) data analysis.

**Joint work with:** Philippe Ciuciu from Team Parietal and
Neurospin, CEA in Saclay.

In this work, the goal is to analyse ASL data by accounting jointly for both the BOLD and perfusion components in the signal. Using the model proposed in , we design a variational EM approach to estimate the model parameters as a faster alternative to the MCMC approach used in and .

**Joint work with:** Jan Warnking from Grenoble Institute of Neuroscience.

The undergoing work is focused on the optimization of nonlinear models for fMRI data analysis, specially Blood-oxygen-level dependent (BOLD) MR modality. The current optimization procedure consists of a Bayesian inversion of the nonlinear model using a Gauss-Newton/Expectation-Maximization algorithm. Such an optimization procedure is time-consuming and achieves sub-optimal results. Therefore, the current research work is mainly focused on improving these results by experimenting with global search optimization methods, like metaheuristics (MHs). Secondly, MHs can also be of great help in the development of minimization algorithms for solving problems with orthogonality constraints (like in polynomial optimization, combinatorial optimization, eigenvalue problems, sparse PCA, matrix rank minimization, etc.). Thus, another main research line is concerned with the application of MHs to this problem and, if necessary, the design and implementation of new evolutionary operators that preserve orthogonality. And, finally, we are also trying to create advanced statistical models for coupling Arterial Spin Labeling (ASL) and BOLD MR modalities to study brain function.

**Joint work with:** Lotfi Chaari, Mohanad Albughdadi, Jean-Yves Tourneret from IRIT-ENSEEIHT in Toulouse and Philippe Ciuciu from Neurospin, CEA in Saclay.

Brain parcellation into a number of hemodynamically homogeneous regions (parcels) is a challenging issue in fMRI analyses. This task has been recently integrated in the joint detection-estimation (JDE) resulting in the so-called joint detection-parcellation-estimation (JPDE) model. JPDE automatically estimates the parcels from the fMRI data but requires the desired number of parcels to be fixed. This is potentially critical in that the chosen number of parcels may influence detection-estimation performance. In this paper , we propose a model selection procedure to automatically fix the number of parcels from the data. The selection procedure relies on the calculation of the free energy corresponding to each concurrent model, within the variational expectation maximization framework. Experiments on synthetic and real fMRI data demonstrate the ability of the proposed procedure to select an adequate number of parcels. We also investigated the use of Latent Dirichlet Processes.

**Joint work with:** Alexis Roche from Siemens Advanced Clinical Imaging Technology, Department of Radiology, CHUV, Signal Processing Laboratory (LTS5), EPFL, Lausanne, Switzerland.

Image-guided diagnosis of brain disease calls for accurate morphometry algorithms, e.g., in order to detect focal atrophy patterns relating to early-stage progression of particular forms of dementia. To date, widely used brain morphometry packages rest upon discrete Markov random field (MRF) image segmentation models that ignore, or do not fully account for partial voluming, leading to potentially inaccurate estimation of tissue volumes. Although several partial volume (PV) estimation methods have been proposed in the literature from the early 90's, none of them seems to be in common use. In , we propose a fast algorithm to estimate brain tissue concentrations from conventional T1-weighted images based on a Bayesian maximum a posteriori formulation that extends the "mixel" model developed in the 90's. A key observation is the necessity to incorporate additional prior constraints to the "mixel" model for the estimation of plausible concentration maps. Experiments on the ADNI standardized dataset show that global and local brain atrophy measures from the proposed algorithm yield enhanced diagnosis testing value than with several widely used soft tissue labeling methods.

**Joint work with:** Emmanuel Barbier and Benjamin Lemasson from Grenoble Institute of Neuroscience.

Advanced statistical clustering approaches are promising tools to better exploit the wealth of MRI information especially on large cohorts and multi-center studies. In neuro-oncology, the use of multiparametric MRI may better characterize brain tumor heterogeneity. To fully exploit multiparametric MRI (e.g. tumor classification), appropriate analysis methods are yet to be developed. They offer improved data quality control by allowing automatic outlier detection and improved analysis by identifying discriminative tumor signatures with measurable predictive power. In this work, we show on small animals data that advanced statistical learning approaches can help 1) in organizing existing data by detecting and excluding outliers and 2) in building a dictionary of tumor fingerprints from a clustering analysis of their microvascular features. Future work should include the integration in a joint statistical model of both automatic ROI delineation and clustering for whole brain data analysis, with a better use of anatomical information. This work has been submitted to the ISMRM 2015 conference and accepted in the SFMRMB 2015 conference .

**This is joint work with:** Eric Coissac and Pierre Taberlet from LECA
(Laboratoire d'Ecologie Alpine) and Alain Viari from Inria team Bamboo.

The study of species cooccurence pattern has always been central to community ecology. The rise of high-throughput molecular methods and their use in ecology nowadays allows for a facilitated access to new data of an unprecedented quantity. We address the question about the identification of genuine species interactions in the light of these novel data. The statistical analysis has to be tailored to the data specifics: the large amount of available data as well as biases inherent to the data extraction methods. The latter can cause spurious interactions while the former complicates any statistical modelling approach. In addition, the resolution of the data provided is rarely on the species level. In this work, we conduct a thorough correlation analysis between MOTUs (molecular operating taxonomic unit) on different spatial scales to investigate global as well as local spatial pattern. Although this type of analysis is per se exploratory, we suggest it here in order to separate true species interaction from random pattern and to identify species subgroups for further in detail modelling. A random-matrix approach allows us to derive objective cut-off values for genuine correlations. We compare the results with those derived by the application of a model-based, sparse regression approach. Our study shows that despite their seemingly less precise nature when it comes to species identification, these data enable us to reveal mechanisms that structure an ecological community. In the light of the nowadays facilitated access to molecular data, this points the way to a novel set of efficient methods for community analysis.

**Joint work with:**
Pierre Fernique (Montpellier 2 University, CIRAD
and Inria Virtual Plants) and Yann Guédon
(CIRAD and Inria Virtual Plants)

Multivariate count data are defined as the number of items in different
states issued from sampling within a population, which individuals
own items in various numbers and states. The analysis of multivariate count data
is a recurrent and crucial issue in numerous modelling problems,
particularly in the fields of biology and ecology (where the data can
represent, for example, children counts associated with multitype
branching processes), sociology and econometrics. Denoting by

Our context of application was characterised by zero-inflated, often
right skewed marginal distributions. Thus, Gaussian and Poisson
distributions were not *a priori* appropriate. Moreover, the
multivariate histograms typically had many cells, most of which
were empty. Consequently, nonparametric estimation was not efficient.

We developed an approach based on probabilistic graphical models (Koller & Friedman, 2009 ) to identify and exploit properties of conditional independence between numbers of children in different states, so as to simplify the specification of their joint distribution. The considered models are based on chain graphs. Model selection procedures are necessary to infer the graph and specify parsimonious distributions. The graph building stage was based on exploring the space of possible chain graph models, which required defining a notion of neighbourhood of these graphs. A parametric distribution was associated with each graph. It was obtained by combining families of univariate and multivariate distributions or regression models. These families were chosen by selection model procedures among different parametric families . To relax the strong constraints regarding dependencies induced by using parametric distributions, mixture of graphical models were also considered .

Further extensions will be considered, and particularly

Hidden Markov tree models (see ) where the hidden state process is a multitype branching process with graphical generation distributions.

Gaussian chain graph models, where the chain components can be identified using lasso methods.

**Joint work with:**
Pierre Fernique (Montpellier 2 University and CIRAD) and Yann Guédon
(CIRAD), Inria Virtual Plants.

Algorithmic issues in hidden Markov tree models were considered by Durand *et al.* (2004) .
This family of models
was used to represent local dependencies and heterogeneity within tree-structured data. It relied on a tree-structured hidden state process,
where the children states were assumed independent given their parent state. The latter assumption has been relaxed in an extension
of these models and new algorithmic solutions for model inference have been proposed in Pierre Fernique's PhD .
An application to the study of the cell lineage in biological tissues responsible for the plant growth has been considered. In this setting,
the number of children is small (between 0 and 2) and a saturated model has been considered to model transitions between parent
and configurations of children states. Extensions will be proposed, based on the parametric discrete multivariate distributions developed
in Section .

**Joint work with:**
Pierre Fernique (Montpellier 2 University and CIRAD) and Yann Guédon
(CIRAD), Inria Virtual Plants.

As an alternative to the hidden Markov tree models discussed in Section , subtrees with similar attributes can be identified using multiple change-point models. These approaches are well-developed in the context of sequence analysis, but their extensions to tree-structured data are not straightforward. Their advantage on hidden Markov models is to relax the strong constraints regarding dependencies induced by parametric distributions and local parent-children dependencies. Heuristic approaches for change-point detection in trees were proposed and applied to the analysis of patchiness patterns (consisting of canopies made of clumps of either vegetative or flowering botanical units) in mango trees .

**Joint work with:**
Anne Guérin-Dugué (GIPSA-lab)
and Benoit Lemaire (Laboratoire de Psychologie et Neurocognition)

In the last years, GIPSA-lab has developed computational models of information search in web-like materials,
using data from both eye-tracking and electroencephalograms (EEGs). These data were obtained from experiments,
in which subjects had to make some kinds of press reviews. In such tasks, reading process and decision making
are closely related. Statistical analysis of such data aims at deciphering underlying dependency structures
in these processes. Hidden Markov models (HMMs) have been used on eye movement series to infer phases
in the reading process that can be interpreted as steps in the cognitive processes leading to decision.
In HMMs, each phase is associated with a state of the Markov chain. The states are observed indirectly
through eye-movements. Our approach was inspired by Simola *et al.* (2008) ,
but we used hidden semi-Markov models for better characterization of phase length distributions.
The estimated HMM highlighted contrasted reading strategies (i.e., state transitions), with both
individual and document-related variability.

However, the characteristics of eye movements within each phase tended to be poorly discriminated. As a result, high uncertainty in the phase changes arose, and it could be difficult to relate phases to known patterns in EEGs.

As a perspective, we aim at developing an integrated model coupling EEG and eye movements within one single HMM for better identification of the phases. Here, the coupling should incorporate some delay between the transitions in both (EEG and eye-movement) chains, since EEG patterns associated to cognitive processes occur lately with respect to eye-movement phases. Moreover, EEGs and scanpaths were recorded with different time resolutions, so that some resampling scheme must be added into the model, for the sake of synchronizing both processes. Probabilistic graphical models (see Section ) will be inferred from the channel correlations to represent interactions between brain zones. The variability of these graphs is partly explained by individual differences in text exploration, which will have to be quantified.

**Joint work with:** Antoine Deleforge, Sileye Ba and Radu Horaud from the
Inria Perception team.

Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear mappings (GLLiM) with a partially-latent response model described in . While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables. Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model. A preliminary version of the work can be found in .

**Joint work with:** L. Gardes (Univ. Strasbourg), A. Daouia
(Univ. Toulouse I and Univ. Catholique de Louvain), J. Elmethni (Univ. Paris 5) and S. Louhichi (Univ. Grenoble 1)

The goal of the PhD thesis of Alexandre Lekina was to contribute to
the development of theoretical and algorithmic models to tackle
conditional extreme value analysis, *ie* the situation where
some covariate information

Conditional extremes are studied in climatology where one is interested in how climate change over years might affect extreme temperatures or rainfalls. In this case, the covariate is univariate (time). Bivariate examples include the study of extreme rainfalls as a function of the geographical location. The application part of the study is joint work with the LTHE (Laboratoire d'étude des Transferts en Hydrologie et Environnement) located in Grenoble.

**Joint work with:** E. Deme (Univ. Gaston-Berger, Sénégal, J. Elmethni (Univ. Paris 5), L. Gardes and A. Guillou (Univ. Strasbourg)

One of the most popular risk measures is the Value-at-Risk (VaR) introduced in the 1990's.
In statistical terms,
the VaR at level *i.e.* when

**Joint work with:** C. Amblard (TimB in TIMC laboratory, Univ. Grenoble I), L. Gardes (Univ. Strasbourg) and L. Menneteau (Univ. Montpellier II)

Copulas are a useful tool to model multivariate distributions . At first, we developed an extension of some particular copulas . It followed a new class of bivariate copulas defined on matrices and some analogies have been shown between matrix and copula properties.

However, while there exist various families of bivariate copulas, much fewer has been done when the dimension is higher. To this aim an interesting class of copulas based on products of transformed copulas has been proposed in the literature. The use of this class for practical high dimensional problems remains challenging. Constraints on the parameters and the product form render inference, and in particular the likelihood computation, difficult. We proposed a new class of high dimensional copulas based on a product of transformed bivariate copulas . No constraints on the parameters refrain the applicability of the proposed class which is well suited for applications in high dimension. Furthermore the analytic forms of the copulas within this class allow to associate a natural graphical structure which helps to visualize the dependencies and to compute the likelihood efficiently even in high dimension. The extreme properties of the copulas are also derived and an R package has been developed.

As an alternative, we also proposed a new class of copulas constructed by introducing a latent factor. Conditional independence with respect to this factor and the use of a nonparametric class of bivariate copulas lead to interesting properties like explicitness, flexibility and parsimony. In particular, various tail behaviours are exhibited, making possible the modeling of various extreme situations . A pairwise moment-based inference procedure has also been proposed and the asymptotic normality of the corresponding estimator has been established .

In collaboration with L. Gardes, we investigate the estimation of the tail copula which is widely used to describe the amount of extremal dependence of a multivariate distribution. In some situations such as risk management, the dependence structure can be linked with some covariate. The tail copula thus depends on this covariate and is referred to as the conditional tail copula. The aim of our work is to propose a nonparametric estimator of the conditional tail copula and to establish its asymptotic normality .

**Joint work with:** A. Guillou and L. Gardes (Univ. Strasbourg), A. Nazin (Univ. Moscou), G. Stupfler (Univ. Aix-Marseille)
and A. Daouia (Univ. Toulouse I and Univ. Catholique de Louvain)

The boundary bounding the set of points is viewed as the larger level set of the points distribution. This is then an extreme quantile curve estimation problem. We proposed estimators based on projection as well as on kernel regression methods applied on the extreme values set, for particular set of points . We also investigate the asymptotic properties of existing estimators when used in extreme situations. For instance, we have established in collaboration with G. Stupfler that the so-called geometric quantiles have very counter-intuitive properties in such situations , and thus should not be used to detect outliers. These resuls are submitted for publication.

In collaboration with A. Daouia, we investigate the application of such methods in econometrics : A new characterization of partial boundaries of a free disposal multivariate support is introduced by making use of large quantiles of a simple transformation of the underlying multivariate distribution. Pointwise empirical and smoothed estimators of the full and partial support curves are built as extreme sample and smoothed quantiles. The extreme-value theory holds then automatically for the empirical frontiers and we show that some fundamental properties of extreme order statistics carry over to Nadaraya's estimates of upper quantile-based frontiers.

In collaboration with A. Nazin, we define new estimators of the frontier
function based on linear programming methods. The frontier is defined as the solution
of a linear optimization problem under inequality constraints. The estimator
is shown to be strongly consistent with respect to the

In collaboration with G. Stupfler and A. Guillou, new estimators of the boundary are introduced. The regression is performed on the whole set of points, the selection of the “highest” points being automatically performed by the introduction of high order moments .

**Joint work with:** S. Douté from Laboratoire de
Planétologie de Grenoble, J. Chanussot (Gipsa-lab and Grenoble-INP) and J. Saracco (Univ. Bordeaux).

Visible and near infrared imaging spectroscopy is
one of the key techniques
to detect, to map and to characterize mineral and volatile (eg.
water-ice)
species existing at
the surface of planets. Indeed the chemical composition,
granularity, texture, physical state, etc. of the materials
determine the existence and morphology of the absorption bands.
The resulting spectra contain therefore very useful information.
Current imaging spectrometers provide data organized as three
dimensional hyperspectral images: two spatial dimensions and one
spectral dimension. Our goal is to estimate the functional
relationship

In his PhD thesis work, Alessandro Chiancone studies the extension of the SIR method to different sub-populations. The idea is to assume that the dimension reduction subspace may not be the same for different clusters of the data . He also published a paper on a previous work in the field of hierarchical segmentation of images .

A contract with the HEMERA company was contracted including the internships of Anne Charlier and Lisa Qianru. Hemera designs, produces and sells online liquid and gaz analyzers. It is located in Grenoble. The aim of Hemera is to measure, in any gaseous or liquid environment, with a minimalized environmental impact and in a selective way, all compounds seen nowadays as pollutants : for our health, for an industrial process, etc. Hemera's analyzers measure gaz concentrations using optical techniques. The goal of the collaboration was to investigate the use of statistical methods to improve both the determination of the present gaz and their respective concentrations from the analysis of spectra representing a mixture of the different gaz. A preliminary study based on the Lasso technique was implemented and tested with promising first conclusions.

**PERSYVACT project.**mistis is involved in a 2-year exploratory project, funded (20 keuros for the whole project) by the PERSYVAL labex (https://**15 keuros** from the labex for the PhD of A. Chiancone co-advised with
J. Chanussot from GIPSA-Lab.

**Grenoble Pole Cognition (2013-14).** We received in 2012, 2013 and 2014 **2.5 keuros** from the Grenoble Pole Cogntion, http://

mistis is involved in three regional initiatives: PEPS (funded by CNRS and the PRES of Grenoble), AGIR (funded by Université Grenoble 1 and Grenoble-INP) and the MOTU project (funded by UPMF). The first two projects focus on the modelling of the extreme risk and its application in social science. The partners include the LTHE (Laboratoire d'étude des Transferts en Hydrologie et Environnement) and the 3S-R lab (Sols, Solides, Structures - Risques). The third project focuses on the use of statistical techniques for transportation data analysis and involves the GAEL laboratory (Grenoble Applied Economics Laboratory).

mistis participates in the weekly statistical seminar of Grenoble. Jean-Baptiste Durand is in charge of the organization and several lecturers have been invited in this context.

S. Girard is at the head of the probability and statistics department of the LJK since september 2012.

The context of our research is also the collaboration
between mistis and a number of international partners such
as the Statistics Department of University of Washington in
Seattle, the Russian Academy of Science in Moscow, the National University of Ireland in Galway, and more recent partners like IDIAP involved in the HUMAVIPS project, Université Gaston Berger in Senegal and University of Melbourne in Australia.
We will also work at turning other current European contacts, *e.g.* at EPFL (A. Roche at University Hospital Lausanne and Siemens Healthcare), into more formal partnerships and eventually explore the possibility for a H2020 project in the *Personalizing Health and Care* axis.

The main international collaborations that we are currently trying to develop are with:

Fabrizio Durante, Free University of Bozen-Bolzano, Italy.

Emma Holian and John Hinde from National University of Ireland, Galway, Ireland.

K. Qin and D. Wraith from RMIT in Melbourne, Australia and Queensland University of Technology in Brisbane, Australia.

E. Deme and S. Sylla from Saint Louis university and IRD in Saint Louis, Senegal.

Alexandre Nazin and Russian Academy of Science in Moscow, Russia.

Alexis Roche and University Hospital Lausanne/Siemens Healthcare, Advanced Clinical Imaging Technology group, Lausanne, Switzerland.

Seydou Nourou Sylla (Université Gaston Berger, Sénégal) has been hosted by the mistis team for four months.

Darren Wraith (Queensland University of Technology in Brisbane, Australia) has been hosted by the mistis team for 2 weeks.

Alexis Arnaud (Master, from Feb 2014 until June 2013)

Subject: Mixtures of generalized Student multivariate distributions: application to tumor characterisation from multiparametric MRI.

Institution: University Montpellier 2

Anne Charlier (2nd year)

Subject: Estimation of gaz concentrations in a gaz mixture from spectrophotometric measures.

Institution: PHELMA, Grenoble-INP

Lisa Qian-ru (Master)

Subject: Inverse regression to identify and quantify polluants from UV spectroscopy measures.

Institution: Univ. PMF, Hemera, Meylan

Seydou-Nourou Sylla (PhD, from September 2014 to December 2014)

Subject: Classification for medical data

Institution: Université Gaston Berger (Sénégal)

F. Forbes co-organized the workshop *Statistical Challenges in Neuroscience* in Warwick, UK in Sept. 2014, http://

F. Forbes co-organized the workshop on *Probabilisitic graphical models and structured data on graphs* in Grenoble, in July 2014.

Stéphane Girard co-organized the workshop "Extreme Value Theory, Spatial and Temporal Aspects", Besançon,
https://

Stéphane Girard co-organized the “Rencontres d'Astrostatistique”, Grenoble,
http://

"Extremes and Copulas", Grenoble,
http://

Stéphane Girard organized the workshop
“*Copulas and extremes*”, Grenoble,
http://

Marie José Martinez, Jean Baptiste Durand, Florence Forbes in collaboration with Iragael Joly (Grenoble Applied Economics Laboratory) organized the workshop "Statistics, Activities and Transportation" in Grenoble
http://

F. Forbes is a member of the committee for the 2nd SFRMBM (Société Francaise de Résonance Magnétique en Biologie et Médecine) conference in Grenoble in 2015, http://

Stéphane Girard organized an invited session "Regression extremes" at the *7th international conference ERCIM*, Pisa, Italy, december 2014.

Florence Forbes is Associate Editor of the journal *Frontiers in ICT: Computer Image Analysis* since its creation in Sept. 2014. Computer Image Analysis is a new specialty section in the community-run openaccess
journal Frontiers in ICT. This section is led by Specialty Chief Editors Drs
Christian Barillot and Patrick Bouthemy.

Stéphane Girard is Associate Editor of the *Statistics and Computing* journal since 2012.
He is also member of the Advisory Board of the *Dependence Modelling* journal since decembre 2014.

In 2014, Florence Forbes has been a reviewer for the *NIPS* and *ICASSP* conferences and for the *Statistics and Computing* journal.

In 2014, Stéphane Girard has been a reviewer for *Annals of Statistics, Journal of Statistical Software, Metrika, Lecture Notes in Statistics, RevStat, ESAIM Probability & Statistics, Journal de la Société Française de Statistique.*

F. Forbes and J.-B. Durand are part of an INRA
(French National Institute for Agricultural Research)
Network (AIGM, http://

F. Forbes and S. Girard were elected as members of the bureau of the “Analyse d'images, quantification, et statistique” group in the Société Française de Statistique (SFdS).

F. Forbes and M-J. Martinez are members of the ERCIM working group on Mixture models.

Licence (IUT): Marie-José Martinez , *Statistics*, 192 ETD, L1 to L3 levels, université Grenoble 2, France.

Master: Jean-Baptiste Durand, *Statistics and probabilty*, 192 ETD, M1 and M2 levels, Ensimag Grenoble INP, France.

Licence (IUT) : Gildas Mazo, Mathematics and C language, 128h, L1 level, université Grenoble 1, France.

Master: Farida Enikeeva, *Statistics*, 96 ETD, M1 level, Ensimag Grenoble INP, France.

Master : Stéphane Girard, *Statistique Inférentielle Avancée*, 45 ETD, M1 level,
Ensimag Grenoble-INP, France and *Introduction à la statistique des valeurs extrêmes*, 12 ETD, M2 level, université Grenoble 2, France.

Master : Florence Forbes, Mixture models and EM algorithm, 12h, M2 level, UFR IM2A, université Grenoble 1, France.

M.-J. Martinez is faculty members at Univ. Pierre Mendès France, Grenoble II.

J.-B. Durand is a faculty member at Ensimag, Grenoble INP.

F. Enikeeva was on a half-time ATER position at Ensimag, Grenoble INP.

PhD : Pierre Fernique, *"A statistical modeling framework
for analyzing tree-indexed data"*, Montpellier 2
University. 10 Dec. 2014, Y. Guédon, J.-B. Durand.

PhD : Gildas Mazo, *"Construction et estimation de copules en grande dimension"*,
Universite Grenoble 1, 17 nov 2014, S. Girard, F. Forbes.

Stéphane Girard has been involved in the following PhD commitees:

Blandine Fillon "Développement d'un outil statistique pour évaluer les charges maximales subies par l'isolation d'une cuve de méthanier au cours de sa période d'exploitation", Univ. Poitiers, December 2014.

Tom Rohmer "Deux tests de détection de rupture dans la copule d'observations multivariées", Univ. Pau et des Pays de l'Adour, October 2014.

Anthony Zullo "Functional analysis of high dimensional remote sensing images : application to the charaterization of semi-natural objects in landscape ecology", Univ. Toulouse, July 2014.

Florence Forbes has been involved in the PhD committees of:

Haithem Boussaid, "Efficient Inference and learning in Graphical models for multi-organ shape segmentation", Ecole Centrale Paris, January 8, 2015 (President).

Zacharie Irace, "Modelisation statistique et segmentation d'images TEP. Application à l'hétérogénéité et au suivi de tumeurs", INP Toulouse, Oct 8, 2014 (Reviewer).

Vincent Brault, "Estimation et selection de modèle pour le modèle des blocs latents" , Paris-Sud University, Sept 9, 2014 (Reviewer).

Florence Forbes has been reviewer for the HDR committee of:

Stéphane Chrétien, "Contribution à l'analyse et à l'amélioration de certaines méthodes pour l'inférence statistique par vraisemblance pénalisée", from Univeristy of Besancon, in Dec. 2014.

From Sept. 2009 to Sept. 2014, F. Forbes was head of the committee in charge of examining post-doctoral candidates at Inria Grenoble Rhône-Alpes ("Comité des Emplois Scientifiques").

Florence Forbes is a member of the INRA committee (CSS MBIA) in charge of evaluating INRA researchers once a year in the MBIA dept of INRA.

Florence Forbes was a member of the committee for research scientist candidate (CR) selection at Inria Lille and at Inria Grenoble in 2014.