NeurIPS 2023 LLM Efficiency Fine-tuning Competition
By: Mark Saroufim , Yotam Perlitz , Leshem Choshen and more
Potential Business Impact:
Makes AI smarter by cleaning its learning data.
Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open evaluation stage with publicly available tasks and a closed evaluation stage with unseen tasks - allowed us to assess the generalizability of fine-tuned LLMs. Our results highlight the limitations of current benchmark-based evaluation schemes for generative models and demonstrate the need for more robust evaluation methods. Notably, the winning submissions utilized standard open-source libraries and focused primarily on data curation. To facilitate further research and promote reproducibility, we release all competition entries, Docker files, and evaluation infrastructure, providing a valuable resource for the community to explore fine-tuning, overfitting, and reproducibility in LLMs.
Similar Papers
NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models
Artificial Intelligence
Tests how well new AI learns facts early on.
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
Machine Learning (CS)
Finds the best AI for jobs faster.
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
Artificial Intelligence
Helps computers write code from new science papers.