rw-book-cover

Metadata

Highlights

  • In tabular machine learning, benchmarks are old news. When ML researchers develop a new machine learning algorithm, they pick a set of datasets like OpenML-CC18 and compare the performance of the new algorithm against the state-of-the-art algorithms.
    But the tabular benchmark situation is changing; a change that goes hand-in-hand with the rise of tabular foundation models. (View Highlight)
  • Changes are two-fold: Benchmarks are becoming more “live” and focused on “capabilities” at least from the lens of tabular foundation models.
    Live benchmarks, like TabArena, are typically very rigorous with a strict protocol for standardized pre-processing and evaluation. But what makes them “live” is that they come with a website and active maintenance, reflecting the current state-of-the-art. This is in contrast to static benchmarks, which may be a table in a PDF paper, without updates. If you want to learn more about TabArena, I have a full blog post: (View Highlight)
  • Live benchmarks are quite common in LLM development, where we have SWE-bench for assessing coding, LongBench for testing LLMs with long contexts, and Humanity’s Last Exam with a list of expert-level questions. Each benchmark addresses different “capabilities” of large language models. (View Highlight)
  • Tabular is, by nature, a narrow modality compared to language, which is more general-purpose (translation, coding, question-answering, …). However, even the tabular modality contains many tasks: classification, regression, quantile regression, missing data imputation, time series forecasting, and many, many more. If someone designs a new ML algorithm, they can benchmark it against any of these tasks (if the algorithm is flexible enough). (View Highlight)
  • The novel appeal with benchmarking tabular foundation models is their even greater flexibility, and that we are not testing an algorithm, but a fixed model.1 Whether you use TabICL for regression, quantile regression, or time series forecasting, it’s always the same pre-trained model, and due to in-context learning, the weights don’t change. This parallels LLMs, where we have pre-trained models with in-context learning. This invites us to reframe tasks as “capabilities.”
    I’m excited about seeing a proliferation of benchmarks to test “capabilities” beyond just classification and regression. For example, ScoringBench evaluates ML algorithms and tabular foundation models based on their capability to predict the full predictive distribution. (View Highlight)
  • While benchmarks have always been a catalyst in machine learning, it feels like the benchmark landscape for tabular ML is changing, due to tabular foundation models. For example, just recently, the MulTaBench paper was put on arxiv. The benchmark contains 40 multimodal datasets, 20 of which are tabular plus text, and the other 20 are tabular plus image. Exactly the catalyst we need to move forward on multi-modal tabular foundation models.
    A piece of evidence pointing toward the new benchmark situation is the TabPFN-3.0 model report. Out of the 20 main pages, 9 are “Experimental Results”, mostly benchmarks. For example, they test classification, regression, and quantile regression capabilities on TabArena, and prediction with text columns on TabStar data, and relational data with RelBenchV1. Not all are “live” benchmarks, but they test different capabilities. (View Highlight)
  • I remain excited about tabular foundation models. The tabular foundation models field already has a strong momentum, and having diverse benchmarks may serve as catalysts. However, there is the risk of overly focusing on benchmarks. Think overfitting and benchmarks-as-marketing, as we are seeing with LLMs. That’s why it’s important that we have many diverse benchmarks from various parties. (View Highlight)