Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
By: Anurag Garg , Muhammad Ali , Noah Hollmann and more
Potential Business Impact:
Makes computers learn from fake data better.
Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
Similar Papers
TabPFN: One Model to Rule Them All?
Machine Learning (CS)
Teaches computers to learn from data faster.
nanoTabPFN: A Lightweight and Educational Reimplementation of TabPFN
Machine Learning (CS)
Makes smart computer models easy to learn.
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
Machine Learning (CS)
Makes computers learn from bigger, more complex data.