Abstract
Maximizing likelihood means minimizing a cost function that is the opposite of log likelihood. This minimization can be calculated by gradient descent. Its convergence depends on the Hessian of the cost function. Convergence is guaranteed if the Hessian is strictly positive, and the speed of exponential convergence depends on its conditioning.
We consider the special case of exponential distributions defined by a Gibbs energy that depends linearly on the parameters. We demonstrate that the Hessian is always positive, but can be ill-conditioned. Parameter optimization can be interpreted as a shift on a Riemannian variety, which is the point of view of information geometry. We consider the special case of multivariate Gaussian distributions.