Abstract

This lecture introduces the operation of a learning algorithm and the trade-off between the bias and variance of prediction estimators. A learning algorithm takes as input a datum x from which it predicts an approximation of the response y. Such an algorithm includes internal parameters that are optimized using training examples, so that the prediction is accurate on these examples. The aim is for the accuracy of the prediction to generalize to other data of the "same type" as the training examples. This is known as the generalization error. The supervised learning algorithm calculates the prediction with a function selected from a class of possible functions, for example, by minimizing the average error obtained on the training examples. This error is the statistical risk. A priori information about the problem is used to specify the approximation class and the risk calculation.

The lecture begins with a presentation of linear classification algorithms, which divide the data space into two parts separated by a hyperplane. The hyperplane is optimized to minimize the classification error. These algorithms are extended by a change of variable that replaces x with a new representation. A central issue is to choose this change of variable in order to minimize the risk of generalization.

The risk of generalization is twofold. An approximation error due to the fact that the answer is approximated on a limited class of functions. This is the bias term. The bias increases as the approximation class is reduced. The second error term is due to the fact that the algorithm minimizes the error on training examples and not on all possible data. This induces a random fluctuation that depends on the choice of training examples, and whose maximum amplitude we want to control. This variability term depends on the complexity of the approximation class. It decreases as the size of the approximation class is reduced. Optimizing the approximation class is therefore a compromise between bias and risk variability.

The statistical framework for PAC (probably approximately correct) approximation consists in establishing conditions for having a generalization risk that becomes arbitrarily small. When the class is of finite size, we can calculate an upper bound on the maximum fluctuation of the risk, as a function of the cardinal of the approximation class, with a probability arbitrarily close to 1. This makes it possible to establish conditions so that the learning algorithm is probably approximately correct.

Bias-Complexity trade-off

Abstract

Documents and media

Speaker(s)

Stéphane Mallat

Events

Data science mapping

2018 challenges presentation (1)

Bias-Complexity trade-off

Challenges 2018 (2)

The curse of large dimensions

Dimensionality reduction and denoising

Fourier analysis, filtering and sampling

Image denoising in a few formulas

Transforms and wavelet bases

Tackling a machine learning competition : methodology and practic…

Bayesian and linear kernel learning

Kernel regression and convex optimization

Kernel classification and SVM

Federated learning for medical data

Gradient descent and neural networks

Stochastic and conditional gradients for neural networks

See also