Substantiation of the Backpropagation Technique via the Hamilton-Pontryagin Formalism for Training Nonconvex Nonsmooth Neural Networks

The paper observes the similarity between the stochastic optimal control of discrete dynamical systems and the training multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. In the paper, the machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems. As a model of nonsmooth nonconvex dependences, … Read more

Generalized Gradients in Problems of Dynamic Optimization, Optimal Control, and Machine Learning

In this work, nonconvex nonsmooth problems of dynamic optimization, optimal control in discrete time (including feedback control), and machine learning are considered from a common point of view. An analogy is observed between tasks of controlling discrete dynamic systems and training multilayer neural networks with nonsmooth target function and connections. Methods for calculating generalized gradients … Read more

Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample

We present two sampled quasi-Newton methods: sampled LBFGS and sampled LSR1. Contrary to the classical variants of these methods that sequentially build (inverse) Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate to produce these approximations. As a result, the approximations constructed make use of more reliable (recent … Read more

Strong mixed-integer programming formulations for trained neural networks

We present strong mixed-integer programming (MIP) formulations for high-dimensional piecewise linear functions that correspond to trained neural networks. These formulations can be used for a number of important tasks, such as verifying that an image classification network is robust to adversarial inputs, or solving decision problems where the objective function is a machine learning model. … Read more

Global Convergence in Deep Learning with Variable Splitting via the Kurdyka-{\L}ojasiewicz Property

Deep learning has recently attracted a significant amount of attention due to its great empirical success. However, the effectiveness in training deep neural networks (DNNs) remains a mystery in the associated nonconvex optimizations. In this paper, we aim to provide some theoretical understanding on such optimization problems. In particular, the Kurdyka-{\L}ojasiewicz (KL) property is established … Read more

A Progressive Batching L-BFGS Method for Machine Learning

The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise … Read more

Bounding and Counting Linear Regions of Deep Neural Networks

We investigate the complexity of deep neural networks (DNN) that represent piecewise linear (PWL) functions. In particular, we study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically. We present (i) tighter upper and lower bounds for the maximum number of linear regions … Read more

Deterministic Global Optimization with Artificial Neural Networks Embedded

Artificial neural networks (ANNs) are used in various applications for data-driven black-box modeling and subsequent optimization. Herein, we present an efficient method for deterministic global optimization of ANN embedded optimization problems. The proposed method is based on relaxations of algorithms using McCormick relaxations in a reduced-space [\textit{SIOPT}, 20 (2009), pp. 573-601] including the convex and … Read more