## Section: New Results

### Regression and machine learning

Participants: E. Albuisson, T. Bastogne, S. Ferrigno, A. Gégout-Petit, F. Greciet, P. Guyot, C. Karmann, J.-M. Monnez, N. Sahki, S. Mézières, B. Lalloué

Through a collaboration with the pharmaceutical company Transgene (Strasbourg, France), we have developed a method for selecting covariates. The problem posed by Transgene was to establish patient profiles on the basis of their response to a treatment developed by Transgene. We have then proposed a new methodology for selecting and ranking covariates associated with a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology successively intertwines the clustering of covariates, decorrelation of covariates using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking. A simulation study shows the interest of the decorrelation inside the different clusters of covariates. We have applied our method to the data of Transgene. For instance, transcriptomic data of 37 patients with advanced non-small-cell lung cancer who have received chemotherapy, to select the transcriptomic covariates that explain the survival outcome of the treatment. Our method has also been applied in another collaboration with biologists (CRAN laboratory, Nancy, France). In that case, our method has been applied to transcriptomic data of 79 breast tumor samples, to define patient profiles for a new metastatic biomarker and associated gene network. Our developed method is a contribution to the development of personalized medicine. We have published the method, as well as the two applications in [27].

In order to detect change of health state for lung-transplanted patient, we have begun to work on breakdowns in multivariate physiological signals. We consider the score-based CUSUM statistic and propose to evaluate the detection performance of some thresholds on simulation data. Two thresholds come from the literature: Wald's constant and Margavio's instantaneous threshold, and three contribution thresholds built by a simulation-based procedure: the first one is constant, the second instantaneous and the third is a dynamical version of the previous one. The threshold's performance is evaluated for several scenarii, according to the detection objective and the real change in the data. The simulation results show that Margavio's threshold is the best at minimizing the detection delay while maintaining the given false alarm rate. But on real data, we suggest to use the dynamic instantaneous threshold because it is the easiest to build for practical implementation. It is the purpose of the communication [11] and the submitted paper [35].

We consider the problem of variable selection in regression models. In particular, we are interested in selecting explanatory covariates linked with the response variable and we want to determine which covariates are relevant, that is which covariates are involved in the model. In this framework, we deal with L1-penalised regression models. To handle the choice of the penalty parameter to perform variable selection, we develop a new method based on knockoffs. This revisited knockoffs method is general, suitable for a wide range of regressions with various types of response variables. Besides, it also works when the number of observations is smaller than the number of covariates and gives an order of importance of the covariates. Finally, we provide many experimental results to corroborate our method and compare it with other variable selection methods. It is the object of communication [17], the submitted paper [30] and a chapter of the PhD thesis [1].

In order to model crack propagation rate, continuous physical phenomenon that presents several regimes, we proposed a piecewise polynomial regression model under continuity and/or derivability assumptions as well as a statistical inference method to estimate the transition times and the parameters of each regime. We proposed several algorithms and studied their efficiency. The most efficient algorithm relies on dynamic programming. It is the object of the communication [14] and the PhD thesis of Florine Greciet.

Let consider a regression model $Y=m\left(X\right)+\sigma \left(X\right)\epsilon $ to explain $Y$ from $X$, where $m(\xb7)$ is the regression function, ${\sigma}^{2}(\xb7)$ the variance function and $\epsilon $ the random error term. Methods to assess how well a model fits a set of observations fall under the banner of goodness-of-fit tests. Many tests have been developed to assess the different assumptions for this kind of model. Most of them are “directional” in that they detect departures from mainly a given assumption of the model. Other tests are “global” in that they assess whether a model fits a data set on all its assumptions. We focus on the task of choosing the structural part $m(\xb7)$. It gets most attention because it contains easily interpretable information about the relationship between $X$ and $Y$. To validate the form of the regression function, we consider three nonparametric tests based on a generalization of the Cramér-von Mises statistic. The first two are directional tests, while the third is a global test. To perform these goodness-of-fit tests based on a generalization of the Cramér-von Mises statistic, we have used Wild bootstrap methods and we also proposed a method to choose the bandwidth parameter used in nonparametric estimations. Then, we have developed the cvmgof R package, an easy-to-use tool for many users. The use of the package is described and illustrated using simulations to compare the three implemented tests in a paper in progress.

In epidemiology, we are working with INSERM clinicians and biostatisticians to study fetal development in the last two trimesters of pregnancy. Reference or standard curves are required in this kind of biomedical problems. Values which lie outside the limits of these reference curves may indicated the presence of disorder. Data are from the French EDEN mother-child cohort (INSERM). It is a mother-child cohort study investigating the prenatal and early postnatal determinants of child health and development. 2002 pregnant women were recruited before 24 weeks of amenorrhoea in two maternity clinics from middle-sized French cities (Nancy and Poitiers). From May 2003 to September 2006, 1899 newborns were then included. The main outcomes of interest are fetal (via ultra-sound) and postnatal growth, adiposity development, respiratory health, atopy, behaviour and bone, cognitive and motor development. We are studying fetal weight that depends on the gestional age in the second and the third trimesters of mother's pregnancy. Some classical empirical and parametric methods as polynomials are first used to construct these curves. Polynomial regression is one of the most common parametric approach for modelling growth data espacially during the prenatal period. However, some of them require strong assumptions. So, we propose to work with semi-parametric LMS method, by modifying the response variable (fetal weight) with a Box-cox transformation. Nonparametric methods as Nadaraya-Watson kernel estimation or local polynomial estimation are also proposed to construct these curves. It is the object of the communication [28] and a paper is in progress. In addition, we want to develop a test, based on Z-scores, to detect any slope breaks in the fetal development curves (work in progress).

Many articles were devoted to the problem of recursively estimating eigenvectors corresponding to eigenvalues in decreasing order of the expectation of a random matrix using an i.i.d. sample of it. The present study makes the following contributions: the convergence of processes to normed eigenvectors is proved under two sets of more general assumptions, the observed random matrices are no more supposed i.i.d.; moreover, the scope of these processes is widened. The application to online principal component analysis of a data stream is treated, assuming that data are realizations of a random vector Z whose expectation is unknown and is estimated online, as well as possibly the metric used when it depends on unknown characteristics of Z; two types of processes are studied: we are no more bound to use a data mini-batch at each step, but we can use all previously observed data up to the current step without storing them, thus taking into account all the information contained in previous data. The conducted experiments have shown that processes of the second type are faster than those of the first type. It is the object of the submitted paper [32] and the communication [21].

The study addresses the problem of constrained binary logistic regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid a numerical explosion which can be encountered, we propose to use a process with online standardized data instead of raw data. This type of process can also be used when we have to standardize the explanatory variables, for example in the case of a shrinkage method such as LASSO. Herein, we define and study the almost sure convergence of an averaged constrained stochastic gradient process with online standardized data. Moreover we propose to use a piecewise constant step-size in order that the step-size does not decrease too quickly and reduce the speed of convergence. Processes of this type are compared to classical processes on real and simulated datasets. The results of conducted experiments confirm the validity of the choices made. This will be used in an ongoing application to online updating of a score in heart failure patients. It is the object of the submitted paper [31] and the communications [20],[19].