Mark Schmidt – Optimization Online

Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation

Published: 2024/04/03

Convex Optimization, Stochastic Programming acceleration, interpolation, SGD, strong growth

We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialized to accelerated … Read more

Greedy Newton: Newton’s Method with Exact Line Search

Published: 2024/01/10

Betty Shea

Mark Schmidt

Convex Optimization

A defining characteristic of Newton’s method is local superlinear convergence within a neighbourhood of a strict local minimum. However, outside this neighborhood Newton’s method can converge slowly or even diverge. A common approach to dealing with non-convergence is using a step size that is set by an Armijo backtracking line search. With suitable initialization the … Read more

Searching for Optimal Per-Coordinate Step-sizes with Multidimensional Backtracking

Published: 2023/06/08

Frederik Kunstner

Mark Schmidt

Convex Optimization, Nonlinear Optimization line-search, optimal diagonal preconditioner, smooth optimization

The backtracking line-search is an effective technique to automatically tune the step-size in smooth optimization. It guarantees similar performance to using the theoretically optimal step-size. Many approaches have been developed to instead tune per-coordinate step-sizes, also known as diagonal preconditioners, but none of the existing methods are provably competitive with the optimal per-coordinate stepsizes. We … Read more

Are we there yet? Manifold identification of gradient-related proximal methods

Published: 2019/03/07

In machine learning, models that generalize better often generate outputs that lie on a low-dimensional manifold. Recently, several works have separately shown finite-time manifold identification by some proximal methods. In this work we provide a unified view by giving a simple condition under which any proximal method using a constant step size can achieve finite-iteration … Read more

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Published: 2019/02/25

Francis Bach

Mark Schmidt

Sharan Vaswani

Convex Optimization, Generalized Convexity/Monoticity interpolation, nesterov acceleration, over-parametrization, stochastic gradient descent

Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov … Read more

Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Published: 2017/12/23

Issam Laradji

Mark Schmidt

Julie Nutini

Convex and Nonsmooth Optimization block-coordinate descent, convex optimization, nonsmooth optimization

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper … Read more

”Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern?

Published: 2017/12/10, Updated: 2018/10/14

Warren Hare

Mark Schmidt

Julie Nutini

Convex and Nonsmooth Optimization, Convex Optimization, Nonsmooth Optimization active-set complexity, active-set identification, proximal gradient methods

Proximal gradient methods have been found to be highly effective for solving minimization problems with non-negative constraints or L1-regularization. Under suitable nondegeneracy conditions, it is known that these algorithms identify the optimal sparsity pattern for these types of problems in a finite number of iterations. However, it is not known how many iterations this may … Read more

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition

Published: 2016/08/16, Updated: 2020/09/12

Hamed Karimi

Mark Schmidt

Julie Nutini

Convex and Nonsmooth Optimization boosting, coordinate descent, gradient descent, l1-regularization, least squares, logistic regression, stochastic gradient, support vector machines, variance reduction

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Lojasiewicz (PL) … Read more

A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets

Published: 2012/02/28

Francis Bach

Mark Schmidt

Nicolas Le Roux

Convex Optimization, Stochastic Approaches

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. Numerical … Read more

Group sparsity via linear-time projection

Published: 2008/07/31, Updated: 2008/08/01

Michael P. Friedlander

Kevin Murphy

Mark Schmidt

Ewout van den Berg

Constrained Nonlinear Optimization

We present an efficient spectral projected-gradient algorithm for optimization subject to a group one-norm constraint. Our approach is based on a novel linear-time algorithm for Euclidean projection onto the one- and group one-norm constraints. Numerical experiments on large data sets suggest that the proposed method is substantially more efficient and scalable than existing methods. CitationTechnical … Read more