PDF e-Pub

Section: New Results

Distributed optimization for machine learning

37. Optimal Convergence Rates for Convex Distributed Optimization in Networks [17] This work proposes a theoretical analysis of distributed optimization of convex functions using a network of computing units. We investigate this problem under two communication schemes (centralized and decentralized) and four classical regularity assumptions: Lipschitz continuity, strong convexity, smoothness, and a combination of strong convexity and smoothness. Under the decentralized communication scheme, we provide matching upper and lower bounds of complexity along with algorithms achieving this rate up to logarithmic constants. For non-smooth objective functions, while the dominant term of the error is in $O\left(1/\sqrt{t}\right)$, the structure of the communication network only impacts a second-order term in $O\left(1/t\right)$, where 𝑡t is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly convex objective functions. Such a convergence rate is achieved by the novel multi-step primal-dual (MSPD) algorithm. Under the centralized communication scheme, we show that the naive distribution of standard optimization algorithms is optimal for smooth objective functions, and provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function for non-smooth functions. We then show that DRS is within a ${d}^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

38. Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives [31] In this paper, we study the problem of minimizing a sum of smooth and strongly convex functions split over the nodes of a network in a decentralized fashion. We propose the algorithm $ESDACD$, a decentralized accelerated algorithm that only requires local synchrony. Its rate depends on the condition number $\kappa$ of the local functions as well as the network topology and delays. Under mild assumptions on the topology of the graph, $ESDACD$ takes a time $O\left(\left({\tau }_{max}+{\Delta }_{max}\right)\sqrt{\kappa /\gamma }ln\left({ϵ}^{-1}\right)\right)$ to reach a precision $ϵ$ where $\gamma$ is the spectral gap of the graph, ${\tau }_{max}$ the maximum communication delay and ${\Delta }_{max}$ the maximum computation time. Therefore, it matches the rate of $SSDA$, which is optimal when ${\tau }_{max}=\Omega \left({\Delta }_{max}\right)$. Applying $ESDACD$ to quadratic local functions leads to an accelerated randomized gossip algorithm of rate $O\left(\sqrt{{\theta }_{\mathrm{gossip}}/n}\right)$ where ${\theta }_{\mathrm{gossip}}$ is the rate of the standard randomized gossip. To the best of our knowledge, it is the first asynchronous gossip algorithm with a provably improved rate of convergence of the second moment of the error. We illustrate these results with experiments in idealized settings.

39. An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums [49] Modern large-scale finite-sum optimization relies on two key aspects: distribution and stochastic updates. For smooth and strongly convex problems, existing decentralized algorithms are slower than modern accelerated variance-reduced stochastic algorithms when run on a single machine, and are therefore not efficient. Centralized algorithms are fast, but their scaling is limited by global aggregation steps that result in communication bottlenecks. In this work, we propose an efficient Accelerated, Decentralized stochastic algorithm for FiniteSums named ADFS, which uses local stochastic proximal updates and randomized pairwise communications between nodes. On machines, ADFS learns from samples in the same time it takes optimal algorithms to learn from samples on one machine. This scaling holds until a critical network size is reached, which depends on communication delays, on the number of samples , and on the network topology. We provide a theoretical analysis based on a novel augmented graph approach combined with a precise evaluation of synchronization times and an extension of the accelerated proximal coordinate gradient algorithm to arbitrary sampling. We illustrate the improvement of ADFS over state-of-the-art decentralized approaches with experiments.