Abstract
The first lecture maps out the three main areas of data science: signal processing, data modeling and prediction. The lecture introduces the major issues at stake in each of these fields, as well as the mathematical and computational concepts they call upon.
In signal processing, we want to calculate an estimate of a signal x with d coefficients , based on measurements. The dimension d is typically greater than one million, whether it's a sound, an image or any other observation. The aim of inverse problems is to improve signal quality. A measuring instrument performs a transformation of the input signal and adds errors, i.e. noise. Inverting the transformation while reducing the noise requires the use of a priori information about the signal's properties. Signal compression is another application, the aim of which is to reduce the number of bits used to encode signals, in order to limit storage space or transmission time. Here again, the aim is to exploit a priori information on signal structure.
Modeling involves capturing the nature and variability of the data. This is done by estimating the distribution of the observed data. This distribution is characterized by a random pattern assumed to have a probability density. This is a function of the large number d of variables in each data item. The main difficulty arises from this large size. The construction of such models is necessary for optimizing signal processing algorithms, for statistical physics, or for the synthesis of new data. It is also useful for prediction.
A prediction calculates an estimate of the answer y to a question, from data x, which can include many variables. For example, y could be the name of an animal that appears in an image x, or a diagnosis estimated from medical data x. Supervised learning optimizes the parameterization of prediction algorithms, using numerous examples composed of data x for which the answer y is known.