Abstract
Participating in a Machine Learning competition requires both advanced computer science skills and a mathematical and algorithmic understanding of Machine Learning models. This presentation explains the iterative process involved in achieving good results in a Machine Learning competition.
The proposed methodology breaks down into 5 phases, repeated until the end of the competition. It begins with a review of the state of the art on the subject, in terms of scientific publications and similar competitions. This is followed by an exploration of the data, to understand its structure and get an initial idea of which features have predictive power. The third phase builds a representation of the data that optimizes these features: this is what we call feature engineering. Having constructed a model evaluation procedure, involving k-fold validation for example, the next step is to create a battery of models, compare and combine them to obtain the best possible predictive model. A data scientist then hypothesizes new features that could provide a more relevant representation of the data, and integrates them, repeating this methodology to improve the results until the end of the competition.
Achieving excellent rankings in Machine Learning competitions therefore requires precise knowledge of the models in order to parameterize them in the best possible way and to be aware of their limits, but also creativity to build a representation of the data likely to contain as much relevant information as possible.