Abstract
This lecture introduces the operation of a learning algorithm and the trade-off between the bias and variance of prediction estimators. A learning algorithm takes as input a datum x from which it predicts an approximation of the response y. Such an algorithm includes internal parameters that are optimized using training examples, so that the prediction is accurate on these examples. The aim is for the accuracy of the prediction to generalize to other data of the "same type" as the training examples. This is known as the generalization error. The supervised learning algorithm calculates the prediction with a function selected from a class of possible functions, for example, by minimizing the average error obtained on the training examples. This error is the statistical risk. A priori information about the problem is used to specify the approximation class and the risk calculation.
The lecture begins with a presentation of linear classification algorithms, which divide the data space into two parts separated by a hyperplane. The hyperplane is optimized to minimize the classification error. These algorithms are extended by a change of variable that replaces x with a new representation. A central issue is to choose this change of variable in order to minimize the risk of generalization.
The risk of generalization is twofold. An approximation error due to the fact that the answer is approximated on a limited class of functions. This is the bias term. The bias increases as the approximation class is reduced. The second error term is due to the fact that the algorithm minimizes the error on training examples and not on all possible data. This induces a random fluctuation that depends on the choice of training examples, and whose maximum amplitude we want to control. This variability term depends on the complexity of the approximation class. It decreases as the size of the approximation class is reduced. Optimizing the approximation class is therefore a compromise between bias and risk variability.
The statistical framework for PAC (probably approximately correct) approximation consists in establishing conditions for having a generalization risk that becomes arbitrarily small. When the class is of finite size, we can calculate an upper bound on the maximum fluctuation of the risk, as a function of the cardinal of the approximation class, with a probability arbitrarily close to 1. This makes it possible to establish conditions so that the learning algorithm is probably approximately correct.