## Section: New Results

### New results Biodiversity

The activity of Pleiade in computational biodiversity has consisted mainly in reinforcing a cooperation with actors in High Performance Computing, namely Inria team Hiepacs, for method developments in metabarcoding. Metabarcoding is a supervised or unsupervised statistical learning method, to build taxonomic inventories from so called environmental samples, i.e. sets of short reads of a same marker for a whole community or guild. Most of tools used therefore still rely on some classical ones shaped in Multivariate Data Analysis. Those tools are indeed well known, but still are often behind the scene in current developments in Machine Learning (like kernel PCA, Support Vector Machines, etc. ...). Most of them, if not all, are based on Singular Value Decomposition of a matrix. If $p$ features are observed on $n$ items, the size of the matrix is $n\times p$. The complexity of such algorithms is in $O\left({p}^{3}\right)$. The recent development of NGS data has had as a consequence to multiply by a factor ${10}^{2}/{10}^{3}$ the size of data sets. This leads to a factor ${10}^{6}/{10}^{9}$ of required computation time. Reaching such a goal is beyond resources currently offered by parallelization. Hence, a new approach has been selected, by using other methods. Indeed, it has been known for some years now that concentration of measure phenomena (a sort of extension of law of large numbers) leads to a blessing of dimensionality, i.e. some randomized methods are available as heuristics to make some matrix computations efficiently and accurately. This is the case for running SVD. Therefore, a cooperation has been set up between HiePacs and Pleiade through Pierre Blanchard (a former Hiepacs PhD student who has held a post-doc position during 7 months in Pleiade ) to implement those methods in the framework of metabarcoding. Former work in Pleiade had led (with a DARI project 2014-2016) to the production of many high-dimensional pairwise distance matrices of DNA environmental samples (amplicon based metabarcoding). Classical Multidimensional Scaling of some of those matrices has been programmed in C++, with dedicated libraries in domain of so called random projection, or column selection (`fmr` library). This has permitted to build a point cloud of an environmental sample of $1.2\times {10}^{5}$ reads, and see its "shape", with eyes, from projections on first axis, and build a low dimensional approximation of it. The outcome is twofolds: $\left(i\right)$ build a point cloud attached to an environmental sample, for further ecological studies and $\left(ii\right)$ delivery of a scientific library in High Performance Computing for randomized matrix computations. These research lines will be carried on in 2018, and the cooperation extended to mésocentre GRICAD in Grenoble for HPC and C++ code development.

Pleiade has carried on statistical learning methods, both supervised and unsupervised in metabarcoding. A cooperation with IMBE at Marseille has permitted to associate MDS as developed above with graph based methods (building connected components of a graph built from pairwise distance matrices after thresholding), and test these methods for unsupervised statistical learning (OTU building) of data sets from an ongoing PhD in Marseille Bay. Cooperation with Institut Pasteur at Cayenne has lead to a joint publication [12] for a proof of concept of an inventory by metagenomics of viromes of bats in French Guiana, with two objectives: $\left(i\right)$ detect as soon as possible some strains which could potentially be transmitted to man and $\left(ii\right)$ develop a viral ecology by studying further how environmental factors and nature of the host drive the virome composition.

Meanwhile, Pleiade has carried on cooperation with SLU Universty at Uppsala especially on metabarcoding of diatom communities in rivers and lakes in Sweden (co-direction of a PhD student located at Uppsala in SLU) , and first steps in biogeography of diatoms in Fennoscandia (cooperation with a PostDoc in SLU).