• Author: Vince Lam
  • Full Title: Decoding Kaggle’s 2023 AI Report: Essential Tips for Machine Learning With Tabular Data 🔍📈
  • URL: tips-for-tabular-ml


  • “It is estimated that between 50% and 90% of practicing data scientists use tabular data as their primary type of data in their primary setting.” (View Highlight)
  • Learning how to improve a model’s performance by a few decimal points may have a positive impact to a company’s bottom line, especially if it serving millions of customers. However, there does come a point of diminishing returns when trying to eek out that extra .0001 of performance, depending on the business context. Because of the iterative nature of ML, it can be difficult to decide when “good” is “good enough”. (View Highlight)
  • Since models are judged by their performance in competitions, a metric that is easily quantified, understood by competitors, and the determinant of the ranking for prizes and accolades - it becomes the main focus. This means a result-first approach rewards black-box approaches which do not consider explainability and interpretability. This is particularly relevant for ensembling, more on that later. (View Highlight)
  • Feature engineering is the process of creating, selecting, and transforming variables to maximise their predictive power in models. It’s a dynamic, iterative, and highly time-consuming process. Feature engineering is well recognised as being one of the most important, if not the most important, part of a tabular ML modelling pipeline, for both competitions and industry. Feature engineering is especially important for tabular data compared to deep learning techniques for computer vision tasks, where data augmentation is more focused on. (View Highlight)