Amphithéâtre Marguerite de Navarre, Site Marcelin Berthelot
Open to all
-

Abstract

This lecture studies batch and stochastic gradient descent algorithms, and their implementation in a neural network with the gradient backpropagation algorithm.

The gradient descent algorithm adjusts parameters to minimize a cost function, which quantifies the error between the estimated response and the correct response on training data. These parameters are modified iteratively by subtracting the gradient from the cost function. This algorithm therefore requires calculating the variation in cost at the network output relative to the variation in network parameters. In a neural network, the cost is calculated by composing linear operators or pointwise differentiable non-linearities. The differentiation of a composition of operators is obtained by applying Leibniz's rule. Derivatives are propagated in the network by successive multiplications of Jacobian matrices.

The gradient is normally averaged over a batch containing all the training data, which is very computationally intensive. The stochastic gradient descent algorithm reduces this batch to a single example taken at random from the database. In the case where the cost function is strictly convex, we demonstrate that the gradient descent algorithm converges to a unique minimum. The speed of convergence is exponential, with an exponent that depends on the Hessian conditioning.

The stochastic gradient descent algorithm converges much more slowly than the batch algorithm , because each iteration is noisy due to the variability of the gradient, which depends on the particular example chosen. However, when the database is large enough, this algorithm may require fewer operations than batch gradient descent to approach the minimum of the cost function.