## Section: New Results

### Algorithmic aspects of topological and geometric data analysis

#### DTM-based filtrations

Participants : Frédéric Chazal, Marc Glisse, Raphaël Tinarrage.

In collaboration with H. Anai, Y. Ike, H. Inakoshi and Y. Umeda of Fujitsu.

Despite strong stability properties, the persistent homology of filtrations classically used in Topological Data Analysis, such as, e.g. the Čech or Vietoris-Rips filtrations, are very sensitive to the presence of outliers in the data from which they are computed. In this paper [33], we introduce and study a new family of filtrations, the DTM-filtrations, built on top of point clouds in the Euclidean space which are more robust to noise and outliers. The approach adopted in this work relies on the notion of distance-to-measure functions, and extends some previous work on the approximation of such functions.

#### Persistent Homology with Dimensionality Reduction: $k$-Distance vs Gaussian Kernels

Participants : Shreya Arya, Jean-Daniel Boissonnat, Kunal Dutta.

We investigate the effectiveness of dimensionality reduction for computing the persistent homology for both $k$-distance and kernel distance [34]. For $k$-distance, we show that the standard Johnson-Lindenstrauss reduction preserves the $k$-distance, which preserves the persistent homology upto a ${(1-\epsilon )}^{-1}$ factor with target dimension $O(klogn/{\epsilon}^{2})$. We also prove a concentration inequality for sums of dependent chi-squared random variables, which, under some conditions, allows the persistent homology to be preserved in $O(logn/{\epsilon}^{2})$ dimensions. This answers an open question of Sheehy. For Gaussian kernels, we show that the standard Johnson-Lindenstrauss reduction preserves the persistent homology up to an $4{(1-\u03f5)}^{-1}$ factor.

#### Computing Persistent Homology of Flag Complexes via Strong Collapses

Participants : Jean-Daniel Boissonnat, Siddharth Pritam.

In collaboration with Divyansh Pareek (Indian Institute of Technology Bombay, India)

We introduce a fast and memory efficient approach to compute the persistent homology (PH) of a sequence of simplicial complexes. The basic idea is to simplify the complexes of the input sequence by using strong collapses, as introduced by J. Barmak and E. Miniam [DCG (2012)], and to compute the PH of an induced sequence of reduced simplicial complexes that has the same PH as the initial one. Our approach has several salient features that distinguishes it from previous work. It is not limited to filtrations (i.e. sequences of nested simplicial subcomplexes) but works for other types of sequences like towers and zigzags. To strong collapse a simplicial complex, we only need to store the maximal simplices of the complex, not the full set of all its simplices, which saves a lot of space and time. Moreover, the complexes in the sequence can be strong collapsed independently and in parallel. Finally, we can compromize between precision and time by choosing the number of simplicial complexes of the sequence we strong collapse. As a result and as demonstrated by numerous experiments on publicly available data sets, our approach is extremely fast and memory efficient in practice [27].

#### Strong Collapse for Persistence

Participants : Jean-Daniel Boissonnat, Siddharth Pritam.

In this paper, we build on the initial success of and show that further decisive progress can be obtained if one restricts the family of simplicial complexes to flag complexes. Flag complexes are fully characterized by their graph (or 1-skeleton), the other faces being obtained by computing the cliques of the graph. Hence, a flag complex can be represented by its graph, which is a very compact representation. Flag complexes are very popular and, in particular, Vietoris-Rips complexes are by far the most widely simplicial complexes used in Topological Data Analysis. It has been shown in that the persistent homology of Vietoris-Rips filtrations can be computed very efficiently using strong collapses. However, most of the time was devoted to computing the maximal cliques of the complex prior to their strong collapse. In this paper [37], we observe that the reduced complex obtained by strong collapsing a flag complex is itself a flag complex. Moreover, this reduced complex can be computed using only the 1-skeleton (or graph) of the complex, not the set of its maximal cliques. Finally, we show how to compute the equivalent filtration of the sequence of reduced flag simplicial complexes using again only 1-skeletons. x On the theory side, we show that strong collapses of flag complexes can be computed in time $O\left({v}^{2}{k}^{2}\right)$ where $v$ is the number of vertices of the complex and $k$ the maximal degree of its graph. The algorithm described in this paper has been implemented and the code will be soon released in the Gudhi library. Numerous experiments show that our method outperforms previous methods, e.g. Ripser.

#### Triangulating submanifolds: An elementary and quantified version of Whitney's method

Participants : Jean-Daniel Boissonnat, Siargey Kachanovich, Mathijs Wintraecken.

We quantize Whitney's construction to prove the existence of a triangulation for any ${C}^{2}$ manifold, so that we get an algorithm with explicit bounds. We also give a new elementary proof, which is completely geometric [36].

#### Randomized incremental construction of Delaunay triangulations of nice point sets

Participants : Jean-Daniel Boissonnat, Kunal Dutta, Marc Glisse.

In collaboration with Olivier Devillers (Inria, CNRS, Loria, Université de Lorraine).

*Randomized incremental construction* (RIC) is one of the most
important paradigms for building geometric data structures.
Clarkson and Shor developed a general theory that led to
numerous algorithms that are both simple and efficient in
theory and in practice.

Randomized incremental constructions are most of the time space and time optimal in the worst-case, as exemplified by the construction of convex hulls, Delaunay triangulations and arrangements of line segments.

However, the worst-case scenario occurs rarely in practice and we would like to understand how RIC behaves when the input is nice in the sense that the associated output is significantly smaller than in the worst-case. For example, it is known that the Delaunay triangulations of nicely distributed points in ${\mathbb{R}}^{d}$ or on polyhedral surfaces in ${\mathbb{R}}^{3}$ has linear complexity, as opposed to a worst-case complexity of $\Theta \left({n}^{\lfloor d/2\rfloor}\right)$ in the first case and quadratic in the second. The standard analysis does not provide accurate bounds on the complexity of such cases and we aim at establishing such bounds in this paper [35]. More precisely, we will show that, in the two cases above and variants of them, the complexity of the usual RIC is $O(nlogn)$, which is optimal. In other words, without any modification, RIC nicely adapts to good cases of practical value.

Along the way, we prove a probabilistic lemma for sampling without replacement, which may be of independent interest.

#### Approximate Polytope Membership Queries

Participant : Guilherme Da Fonseca.

In collaboration with Sunil Arya (Hong Kong University of Science and Technology) and David Mount (University of Maryland).

In the polytope membership problem, a convex polytope $K$ in ${\mathbb{R}}^{d}$ is given, and the objective is to preprocess $K$ into a data structure so that, given any query point $q\in {\mathbb{R}}^{d}$, it is possible to determine efficiently whether $q\in K$. We consider this problem in an approximate setting. Given an approximation parameter $\u03f5$, the query can be answered either way if the distance from $q$ to $K$'s boundary is at most $\u03f5$ times $K$'s diameter. We assume that the dimension $d$ is fixed, and $K$ is presented as the intersection of $n$ halfspaces. Previous solutions to approximate polytope membership were based on straightforward applications of classic polytope approximation techniques by Dudley (1974) and Bentley et al. (1982). The former is optimal in the worst-case with respect to space, and the latter is optimal with respect to query time. We present four main results. First, we show how to combine the two above techniques to obtain a simple space-time trade-off. Second, we present an algorithm that dramatically improves this trade-off. In particular, for any constant $\alpha \ge 4$, this data structure achieves query time roughly $O(1/{\u03f5}^{\left(d-1\right)/\alpha})$ and space roughly $O(1/{\u03f5}^{\left(d-1\right)\left(1-\Omega \right(log\alpha \left)\right)/\alpha})$. We do not know whether this space bound is tight, but our third result shows that there is a convex body such that our algorithm achieves a space of at least $\Omega (1/{\u03f5}^{\left(d-1\right)\left(1-O\left(\sqrt{\alpha}\right)\right)/\alpha})$. Our fourth result shows that it is possible to reduce approximate Euclidean nearest neighbor searching to approximate polytope membership queries. Combined with the above results, this provides significant improvements to the best known space-time trade-offs for approximate nearest neighbor searching in ${\mathbb{R}}^{d}$. For example, we show that it is possible to achieve a query time of roughly $O(logn+1/{\u03f5}^{d/4})$ with space roughly $O(n/{\u03f5}^{d/4})$, thus reducing by half the exponent in the space bound [11].

#### Approximate Convex Intersection Detection with Applications to Width and Minkowski Sums

Participant : Guilherme Da Fonseca.

In collaboration with Sunil Arya (Hong Kong University of Science and Technology) and David Mount (University of Maryland).

Approximation problems involving a single convex body in $d$-dimensional space have received a great deal of attention in the computational geometry community. In contrast, works involving multiple convex bodies are generally limited to dimensions $d\le 3$ and/or do not consider approximation. In this paper, we consider approximations to two natural problems involving multiple convex bodies: detecting whether two polytopes intersect and computing their Minkowski sum. Given an approximation parameter $\u03f5>0$, we show how to independently preprocess two polytopes $A$, $B$ into data structures of size $O(1/{\u03f5}^{\left(d-1\right)/2})$ such that we can answer in polylogarithmic time whether $A$ and $B$ intersect approximately. More generally, we can answer this for the images of $A$ and $B$ under affine transformations. Next, we show how to $\u03f5$-approximate the Minkowski sum of two given polytopes defined as the intersection of $n$ halfspaces in $O(nlog(1/\u03f5)+1/{\u03f5}^{(d-1)/2+\alpha})$ time, for any constant $\alpha >0$. Finally, we present a surprising impact of these results to a well studied problem that considers a single convex body. We show how to $\u03f5$-approximate the width of a set of n points in $O(nlog(1/\u03f5)+1/{\u03f5}^{\left(d-1\right)/2+\alpha})$ time, for any constant $\alpha >0$, a major improvement over the previous bound of roughly $O(n+1/{\u03f5}^{d-1})$ time [22].

#### Approximating the Spectrum of a Graph

Participant : David Cohen-Steiner.

In collaboration with Weihao Kong (Stanford University), Christian Sohler (TU Dortmund) and Gregory Valiant (Stanford University).

The spectrum of a network or graph $G=(V,E)$ with adjacency matrix A , consists of the eigenvalues of the normalized Laplacian $L=I-{D}^{-1/2}A{D}^{-1/2}$. This set of eigenvalues encapsulates many aspects of the structure of the graph, including the extent to which the graph posses community structures at multiple scales. We study the problem of approximating the spectrum, $\lambda =({\lambda}_{1},\cdots ,{\lambda}_{\left|V\right|})$, of G in the regime where the graph is too large to explicitly calculate the spectrum. We present a sublinear time algorithm that, given the ability to query a random node in the graph and select a random neighbor of a given node, computes a succinct representation of an approximation $\tilde{\lambda}=({\tilde{\lambda}}_{1},\cdots ,{\tilde{\lambda}}_{\left|V\right|})$, such that $\parallel \tilde{\lambda}{-\lambda \parallel}_{1}\le \epsilon \left|V\right|$. Our algorithm has query complexity and running time $\mathrm{exp}\left(O\right(1/\epsilon \left)\right)$, which is independent of the size of the graph, $\left|V\right|$. We demonstrate the practical viability of our algorithm on synthetically generated graphs, and on 15 different real-world graphs from the Stanford Large Network Dataset Collection, including social networks, academic collaboration graphs, and road networks. For the smallest of these graphs, we are able to validate the accuracy of our algorithm by explicitly calculating the true spectrum; for the larger graphs, such a calculation is computationally prohibitive. The spectra of these real-world networks reveal insights into the structural similarities and differences between them, illustrating the potential value of our algorithm for efficiently approximating the spectrum of large networks [29].

#### Spectral Properties of Radial Kernels and Clustering in High Dimensions

Participants : David Cohen-Steiner, Alba Chiara de Vitis.

In this paper [40], we study the spectrum and the eigenvectors of radial kernels for mixtures of distributions in ${\mathbb{R}}^{n}$. Our approach focuses on high dimensions and relies solely on the concentration properties of the components in the mixture. We give several results describing of the structure of kernel matrices for a sample drawn from such a mixture. Based on these results, we analyze the ability of kernel PCA to cluster high dimensional mixtures. In particular, we exhibit a specific kernel leading to a simple spectral algorithm for clustering mixtures with possibly common means but different covariance matrices. This algorithm will succeed if the angle between any two covariance matrices in the mixture (seen as vectors in ${\mathbb{R}}^{{n}^{2}}$) is larger than $\Omega \left({n}^{-1/6}{log}^{5/3}n\right)$. In particular, the required angular separation tends to 0 as the dimension tends to infinity. To the best of our knowledge, this is the first polynomial time algorithm for clustering such mixtures beyond the Gaussian case.

#### Exact computation of the matching distance on 2-parameter persistence modules

Participant : Steve Oudot.

In collaboration with Michael Kerber (T.U. Graz) and Michael Lesnick (SUNY).

The matching distance is a pseudometric on multi-parameter persistence modules, defined in terms of the weighted bottleneck distance on the restriction of the modules to affine lines. It is known that this distance is stable in a reasonable sense, and can be efficiently approximated, which makes it a promising tool for practical applications. In [44] we show that in the 2-parameter setting, the matching distance can be computed exactly in polynomial time. Our approach subdivides the space of affine lines into regions, via a line arrangement. In each region, the matching distance restricts to a simple analytic function, whose maximum is easily computed. As a byproduct, our analysis establishes that the matching distance is a rational number, if the bigrades of the input modules are rational.

#### A Comparison Framework for Interleaved Persistence Modules

Participant : Miroslav Kramár.

In collaboration with Rachel Levanger (UPenn), Shaun Harker and Konstantin Mischaikow (Rutgers).

In [43], we present a generalization of the induced matching theorem of [1] and use it to prove a generalization of the algebraic stability theorem for R-indexed pointwise finite-dimensional persistence modules. Via numerous examples, we show how the generalized algebraic stability theorem enables the computation of rigorous error bounds in the space of persistence diagrams that go beyond the typical formulation in terms of bottleneck (or log bottleneck) distance.

#### Discrete Morse Theory for Computing Zigzag Persistence

Participant : Clément Maria.

In collaboration with Hannah Schreiber (Graz University of Technology, Austria)

We introduce a framework to simplify zigzag filtrations of general complexes using discrete Morse theory, in order to accelerate the computation of zigzag persistence. Zigzag persistence is a powerful algebraic generalization of persistent homology. However, its computation is much slower in practice, and the usual optimization techniques cannot be used to compute it. Our approach is different in that it preprocesses the filtration before computation. Using discrete Morse theory, we get a much smaller zigzag filtration with same persistence. The new filtration contains general complexes. We introduce new update procedures to modify on the fly the algebraic data (the zigzag persistence matrix) under the new combinatorial changes induced by the Morse reduction. Our approach is significantly faster in practice [45].