Abstract
The optimization of a neural network consists in estimating a vector of theta parameters which minimizes a risk calculated on the training examples. This is done by gradient descent, so the risk must be differentiable.
For classification, the maximum likelihood principle is used to define a differentiable risk. Maximum likelihood looks for the parameters θ that maximize the log of the conditional probability of y' = f (x ') for the examples x',y ' in the training database. This maximum likelihood is shown to maximize the Kullback-Liebler distance between the conditional data distribution and the distribution parameterized by θ. The risk is therefore defined by this distance. In the case where the conditional probability model is Gaussian, we obtain a quadratic regression risk.
The most common way to classify a neural network is to choose a conditional probability model defined by a softmax. It assigns a probability distribution to a set of values zk calculated for each class k, where the probability of zk is close to 1 when zk has the highest value among all other zk'. Maximum likelihood can then be calculated analytically as a function of zk, and is a differentiable function.
Logistic regression is a multi-class classifier for which the outputs zk are affine functions of the input x. The likelihood maximization computed with a softmax is a convex function of the parameters and therefore admits a unique solution. We show that the uniqueness of the solution comes from the introduction of a margin criterion that optimizes the position of the boundaries.