A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muon

A unified framework for first-order optimization algorithms for nonconvex unconstrained optimization is proposed that uses adaptively preconditioned gradients and includes popular methods such as full and diagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo and Muon. This framework also allows combining heterogeneous geometries across different groups of variables while preserving a unified convergence … Read more

New Results on the Polyak Stepsize: Tight Convergence Analysis and Universal Function Classes

In this paper, we revisit a classical adaptive stepsize strategy for gradient descent: the Polyak stepsize (PolyakGD), originally proposed in Polyak (1969). We study the convergence behavior of PolyakGD from two perspectives: tight worst-case analysis and universality across function classes. As our first main result, we establish the tightness of the known convergence rates of … Read more

Gradient Methods with Online Scaling Part I. Theoretical Foundations

This paper establishes the theoretical foundations of the online scaled gradient methods (OSGM), a framework that utilizes online learning to adapt stepsizes and provably accelerate first-order methods. OSGM quantifies the effectiveness of a stepsize by a feedback function motivated from a convergence measure and uses the feedback to adjust the stepsize through an online learning … Read more

prunAdag: an adaptive pruning-aware gradient method

A pruning-aware adaptive gradient method is proposed which classifies the variables in two sets before updating them using different strategies. This technique extends the “relevant/irrelevant” approach of Ding (2019) and Zimmer et al. (2022) and allows a posteriori sparsification of the solution of model parameter fitting problems. The new method is proved to be convergent … Read more