On Averaging and Extrapolation for Gradient Descent

This work considers the effect of averaging, and more generally extrapolation, of the iterates of gradient descent in smooth convex optimization. After running the method, rather than reporting the final iterate, one can report either a convex combination of the iterates (averaging) or a generic combination of the iterates (extrapolation). For several common stepsize sequences, … Read more

Accelerated Gradient Descent via Long Steps

Recently Grimmer [1] showed for smooth convex optimization by utilizing longer steps periodically, gradient descent’s state-of-the-art O(1/T) convergence guarantees can be improved by constant factors, conjecturing an accelerated rate strictly faster than O(1/T) could be possible. Here we prove such a big-O gain, establishing gradient descent’s first accelerated convergence rate in this setting. Namely, we … Read more

Provably Faster Gradient Descent via Long Steps

This work establishes provably faster convergence rates for gradient descent in smooth convex optimization via a computer-assisted analysis technique. Our theory allows nonconstant stepsize policies with frequent long steps potentially violating descent by analyzing the overall effect of many iterations at once rather than the typical one-iteration inductions used in most first-order method analyses. We … Read more

Preconditioned Gradient Descent for Overparameterized Nonconvex Burer–Monteiro Factorization with Global Optimality Certification

We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally \emph{rank deficient}, then its … Read more

Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization

For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is … Read more

Adaptive Gradient Descent without Descent

We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don’t increase the stepsize too fast and 2) don’t overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive … Read more

Complexity of gradient descent for multiobjective optimization

A number of first-order methods have been proposed for smooth multiobjective optimization for which some form of convergence to first order criticality has been proved. Such convergence is global in the sense of being independent of the starting point. In this paper we analyze the rate of convergence of gradient descent for smooth unconstrained multiobjective … Read more

New analysis of linear convergence of gradient-type methods via unifying error bound conditions

The subject of linear convergence of gradient-type methods on non-strongly convex optimization has been widely studied by introducing several notions as sufficient conditions. Influential examples include the error bound property, the restricted strongly convex property, the quadratic growth property, and the Kurdyka-{\L}ojasiewicz property. In this paper, we first define a group of error bound conditions … Read more

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Lojasiewicz (PL) … Read more