Convergence of stochastic gradient descent

Abstract

The convergence of batch and stochastic gradient descent is demonstrated in the case where the cost function is Lipchitz and strongly convex relative to its parameters.

For gradient descent, we show an exponential decay in t of the Euclidean distance between the parameter vector calculated at an iteration t and the optimal vector that minimizes the cost function. The exponent of this decay depends on the Lipchitz bound and the strong convexity constant.

The stochastic gradient algorithm can be interpreted as a noisy version of the gradient descent algorithm. This noise depends on the variance of the cost function for randomly selected examples. We prove a first theorem that calculates an upper bound on the Euclidean distance between the parameter vector obtained at iteration t and the optimal vector that minimizes the cost function. An optimal choice of gradient step results in an upper bound that decreases as 1/t. These theorems can be extended to the case of convex, non-differentiable cost functions, by replacing the gradient with the notion of subgradient, and the same type of result is obtained.

Gradient descent algorithms are limited in accuracy by the random variability of the cost function. To converge on the optimal solution, we need to use hybrid algorithms that behave like stochastic gradient descent as long as we are far from the optimal solution, then incorporate the examples seen in the past to behave like batch gradient descent when the number of iterations becomes large. This converges towards the optimal solution.

The convergence properties of deep neural network parameters remain poorly understood, as the cost functions are not convex. Optimization of these networks therefore retains a strong experimental component. For optimization to be successful, there needs to be a low density of local minima, which have a much higher cost than global minima. This depends on the network architecture. A great deal of research is dedicated to understanding this problem and studying the geometry of the neural network linearoptimization landscape , based on the wavelet transform. In 2018-2019, a significant part of the research was dedicated to the construction of stochastic models for the synthesis of images, music, finance, but also to model turbulence phenomena in physics and telecommunication network properties.

Convergence of stochastic gradient descent

Abstract

Speaker(s)

Stéphane Mallat

Events

Introduction to deep neural networks

Introducing 7 data challenges 2019 (1)

Applications of deep neural networks

Introducing 7 data challenges 2019 (2)

Neural network approximations and regularity

Presentation of the 2018 challenge winners

The origins : cybernetics and the perceptron

Weakly supervised learning for visual recognition

Universal single-layer hidden network approximation

Natural language

Approximation error with a hidden layer and regularity

Automatic video analysis

Maximum likelihood network optimization

Deep reinforcement learning

Gradient descent and gradient backpropagation

Convergence of stochastic gradient descent

See also