Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation

We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialized to accelerated … Read more

Greedy Newton: Newton’s Method with Exact Line Search

A defining characteristic of Newton’s method is local superlinear convergence within a neighbourhood of a strict local minimum. However, outside this neighborhood Newton’s method can converge slowly or even diverge. A common approach to dealing with non-convergence is using a step size that is set by an Armijo backtracking line search. With suitable initialization the … Read more

Searching for Optimal Per-Coordinate Step-sizes with Multidimensional Backtracking

The backtracking line-search is an effective technique to automatically tune the step-size in smooth optimization. It guarantees similar performance to using the theoretically optimal step-size. Many approaches have been developed to instead tune per-coordinate step-sizes, also known as diagonal preconditioners, but none of the existing methods are provably competitive with the optimal per-coordinate stepsizes. We … Read more

Are we there yet? Manifold identification of gradient-related proximal methods

In machine learning, models that generalize better often generate outputs that lie on a low-dimensional manifold. Recently, several works have separately shown finite-time manifold identification by some proximal methods. In this work we provide a unified view by giving a simple condition under which any proximal method using a constant step size can achieve finite-iteration … Read more

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov … Read more

Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper … Read more

”Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern?

Proximal gradient methods have been found to be highly effective for solving minimization problems with non-negative constraints or L1-regularization. Under suitable nondegeneracy conditions, it is known that these algorithms identify the optimal sparsity pattern for these types of problems in a finite number of iterations. However, it is not known how many iterations this may … Read more

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Lojasiewicz (PL) … Read more

A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. Numerical … Read more

Group sparsity via linear-time projection

We present an efficient spectral projected-gradient algorithm for optimization subject to a group one-norm constraint. Our approach is based on a novel linear-time algorithm for Euclidean projection onto the one- and group one-norm constraints. Numerical experiments on large data sets suggest that the proposed method is substantially more efficient and scalable than existing methods. CitationTechnical … Read more