OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
By: Wasi Uddin Ahmad , Aleksander Ficek , Mehrzad Samadi and more
Potential Business Impact:
Helps computers write better code.
Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.
Similar Papers
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Computation and Language
Makes free AI understand and talk better.
Data-efficient LLM Fine-tuning for Code Generation
Computation and Language
Trains computers to write better code faster.
Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day
CV and Pattern Recognition
Teaches computers to create good tables from simple instructions.