Gradient Descent using Duality Structures

Gradient descent is commonly used to solve optimization problems arising in machine learning, such as training neural networks. Although it seems to be effective for many different neural network training problems, it is unclear if the effectiveness of gradient descent can be explained using existing performance guarantees for the algorithm. We argue that existing analyses of gradient descent rely on assumptions that are too strong to be applicable in the case of multi-layer neural networks. To address this, we propose an algorithm, duality structure gradient descent (DSGD), that is amenable to a non-asymptotic performance analysis, under mild assumptions on the training set and network architecture. The algorithm can be viewed as a form of layer-wise coordinate descent, where at each iteration the algorithm chooses one layer of the network to update. The decision of what layer to update is done in a greedy fashion, based on a rigorous lower bound of the function decrease for each possible choice of layer. In the analysis, we bound the time required to reach approximate stationary points, in both the deterministic and stochastic settings. The convergence is measured in terms of a Finsler geometry that is derived from the network architecture and designed to confirm a Lipschitz-like property on the gradient of the training objective function. Numerical experiments in both the full batch and mini-batch settings suggest that the algorithm is a promising step towards methods for training neural networks that are both rigorous and efficient.



View Gradient Descent using Duality Structures