Adaptive Accelerated Gradient Converging Methods under Holderian Error Bound Condition

In this paper, we focus our study on the convergence of (proximal) gradient methods and accelerated (proximal) gradient methods for smooth (composite) optimization under a H\”{o}lderian error bound (HEB) condition. We first show that proximal gradient (PG) method is automatically adaptive to HEB while accelerated proximal gradient (APG) method can be adaptive to HEB by restart with an improved iteration complexity. In particular, PG and adaptive APG method enjoy an $O(Lc^2\max(\frac{1}{\epsilon^{1-2\theta}}, \log(\frac{1}{\epsilon})))$ and $O(\sqrt{L}c\max(\frac{1}{\epsilon^{1/2-\theta}}, \log(\frac{1}{\epsilon})))$ iteration complexity for the convergence of objective value, respectively, where $\theta\in(0,1/2]$ is the exponent and $c$ is the constant factor in the HEB condition. However, the number of iterations to restart APG hinges on a possibly unknown parameter $c$. To address this issue, we propose to develop adaptive gradient converging methods, i.e., using the magnitude of gradient as a criterion for restart and termination. We develop adaptive accelerated gradient converging (adaAGC) methods for solving smooth (composite) optimization under the HEB condition with an explicit exponent $\theta$, and establish an iteration complexity of $O(\sqrt{L}c^{1/2(1-\theta)}\max(\frac{1}{\epsilon^{(1-2\theta)/2(1-\theta)}}, \log(\frac{1}{\epsilon})))$ for the convergence of (proximal) gradient’s norm. We also show that a slightly modified variant of PG has an iteration complexity of $O(Lc^{1/(1-\theta)}\max(\frac{1}{\epsilon^{(1-2\theta)/(1-\theta)}}, \log(\frac{1}{\epsilon})))$ for the convergence of (proximal) gradient’s norm. Furthermore, we demonstrate that these results have important implication and applications in machine learning: (i) if the considered objective function is coercive and semi-algebraic, PG’s convergence speed is essentially $o(\frac{1}{t})$, where $t$ is the total number of iterations; (ii) if the objective function consists of an $\ell_1$, $\ell_\infty$, $\ell_{1,\infty}$, or huber norm regularization and a convex smooth piecewise quadratic loss, which includes least-squares loss, smoothed hinge loss and huber loss, the proposed adaAGC is parameter-free and enjoys a faster linear convergence than PG without any other assumptions (e.g., restricted eigen-value condition). It is notable that for all aforementioned problems, our linear convergence results are global instead of local. To the best of our knowledge, these improved results are the first shown in this work.

Article

Download

View PDF