Geometric data has revolutionized the way we perceive and interact with the three-dimensional world. More generally, data - geometric or otherwise - has taken on an essential role in modern science and, beyond that, in society as a whole. Developing a geometric and topological approach to data analysis, building the mathematical and algorithmic foundations, and developing and disseminating fast, reliable computer programs are just some of the new goals of algorithmic geometry addressed in this latest lecture.
Data are considered as points in a space that is assumed to be provided with a notion of distance (not necessarily the usual Euclidean distance). Most of the time, data points live in high-dimensional spaces, but are generally not uniformly distributed in this gigantic space. The apparent complexity of data does not usually reflect its intrinsic complexity. It is therefore important to look for synthetic models that adequately represent the data and enable it to be analyzed. This is a major problem known as "dimension reduction". How big is the data? Can we plunge the data into a low-dimensional space without altering the distances between points too much? Can we design algorithms whose complexity depends on the dimension of the data rather than that of the space of observations? These questions have led to some remarkable results, and to the development of very useful algorithms, as we saw for high-dimensional neighbor search in the previous lecture.
The lecture introduces two geometric models for data. The first is that of clusters and data partitioning. The second seeks to approximate data by low-dimensional triangulated varieties, even if they are immersed in high-dimensional spaces. This assumption is justified by spectacular results such as Takens' theorem for time series or Johnson Lindenstrauss' lemma and its variants. The variety approximation results described in the previous lectures are usable, but to be usable in practice, they need to be analyzed from a statistical point of view and made robust to the noise always present in the data. This leads to the study of distance to a probability measure and to stability theorems that extend the reconstruction results seen in previous lectures.