Maximum likelihood network optimization

Abstract

The optimization of a neural network consists in estimating a vector of theta parameters which minimizes a risk calculated on the training examples. This is done by gradient descent, so the risk must be differentiable.

For classification, the maximum likelihood principle is used to define a differentiable risk. Maximum likelihood looks for the parameters θ that maximize the log of the conditional probability of y' = f (x ') for the examples x',y ' in the training database. This maximum likelihood is shown to maximize the Kullback-Liebler distance between the conditional data distribution and the distribution parameterized by θ. The risk is therefore defined by this distance. In the case where the conditional probability model is Gaussian, we obtain a quadratic regression risk.

The most common way to classify a neural network is to choose a conditional probability model defined by a softmax. It assigns a probability distribution to a set of values zk calculated for each class k, where the probability of zk is close to 1 when zk has the highest value among all other zk'. Maximum likelihood can then be calculated analytically as a function of zk, and is a differentiable function.

Logistic regression is a multi-class classifier for which the outputs zk are affine functions of the input x. The likelihood maximization computed with a softmax is a convex function of the parameters and therefore admits a unique solution. We show that the uniqueness of the solution comes from the introduction of a margin criterion that optimizes the position of the boundaries.

Maximum likelihood network optimization

Abstract

Speaker(s)

Stéphane Mallat

Events

Introduction to deep neural networks

Introducing 7 data challenges 2019 (1)

Applications of deep neural networks

Introducing 7 data challenges 2019 (2)

Neural network approximations and regularity

Presentation of the 2018 challenge winners

The origins : cybernetics and the perceptron

Weakly supervised learning for visual recognition

Universal single-layer hidden network approximation

Natural language

Approximation error with a hidden layer and regularity

Automatic video analysis

Maximum likelihood network optimization

Deep reinforcement learning

Gradient descent and gradient backpropagation

Convergence of stochastic gradient descent

See also