Score: 0

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

Published: July 5, 2025 | arXiv ID: 2507.03971v1

By: Anurag Garg , Muhammad Ali , Noah Hollmann and more

Potential Business Impact:

Makes computers learn from fake data better.

Business Areas:
Predictive Analytics Artificial Intelligence, Data and Analytics, Software

Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Country of Origin
🇩🇪 Germany

Page Count
11 pages

Category
Computer Science:
Machine Learning (CS)