Accelerating Stochastic Sequential Quadratic Programming for Equality Constrained Optimization using Predictive Variance Reduction

In this paper, we propose a stochastic variance reduction method for solving equality constrained optimization problems. Specifically, we develop a method based on the sequential quadratic programming paradigm that utilizes gradient approximations via predictive variance reduction techniques. Under reasonable assumptions, we prove that a measure of first-order stationarity evaluated at the iterates generated by our … Read more

Training Structured Neural Networks Through Manifold Identification and Variance Reduction

This paper proposes an algorithm, RMDA, for training neural networks (NNs) with a regularization term for promoting desired structures. RMDA does not incur computation additional to proximal SGD with momentum, and achieves variance reduction without requiring the objective function to be of the finite-sum form. Through the tool of manifold identification from nonlinear optimization, we … Read more

Stochastic Variance-Reduced Prox-Linear Algorithms for Nonconvex Composite Optimization

We consider the problem of minimizing composite functions of the form $f(g(x))+h(x)$, where~$f$ and~$h$ are convex functions (which can be nonsmooth) and $g$ is a smooth vector mapping. In addition, we assume that $g$ is the average of finite number of component mappings or the expectation over a family of random component mappings. We propose … Read more

Optimization for Supervised Machine Learning: Randomized Algorithms for Data and Parameters

Many key problems in machine learning and data science are routinely modeled as optimization problems and solved via optimization algorithms. With the increase of the volume of data and the size and complexity of the statistical models used to formulate these often ill-conditioned optimization tasks, there is a need for new efficient algorithms able to … Read more

Stochastic Variance-Reduced Prox-Linear Algorithms for Nonconvex Composite Optimization

We consider minimization of composite functions of the form $f(g(x))+h(x)$, where $f$ and $h$ are convex functions (which can be nonsmooth) and $g$ is a smooth vector mapping. In addition, we assume that $g$ is the average of finite number of component mappings or the expectation over a family of random component mappings. We propose … Read more

Variance Reduction of Stochastic Gradients Without Full Gradient Evaluation

A standard concept for reducing the variance of stochastic gradient approximations is based on full gradient evaluations every now and then. In this paper an approach is considered that — while approximating a local minimizer of a sum of functions — also generates approximations of the gradient and the function values without relying on full … Read more

Inexact proximal stochastic second-order methods for nonconvex composite optimization

In this paper, we propose a framework of Inexact Proximal Stochastic Second-order (IPSS) methods for solving nonconvex optimization problems, whose objective function consists of an average of finitely many, possibly weakly, smooth functions and a convex but possibly nons- mooth function. At each iteration, IPSS inexactly solves a proximal subproblem constructed by using some positive … Read more

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

In this paper, we propose a new technique named \textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. We apply SPIDER to two tasks, namely the stochastic first-order and zeroth-order methods. For stochastic first-order method, combining SPIDER with normalized gradient descent, we propose … Read more

A SMART Stochastic Algorithm for Nonconvex Optimization with Applications to Robust Machine Learning

Machine learning theory typically assumes that training data is unbiased and not adversarially generated. When real training data deviates from these assumptions, trained models make erroneous predictions, sometimes with disastrous effects. Robust losses, such as the huber norm are designed to mitigate the effects of such contaminated data, but they are limited to the regression … Read more

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Lojasiewicz (PL) … Read more