Score: 2

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Published: July 3, 2025 | arXiv ID: 2507.02378v1

By: Weijie Lyu, Sheng-Jun Huang, Xuan Xia

Potential Business Impact:

Teaches computers to write better code faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

Data-efficient LLM Fine-tuning for Code Generation

Computation and Language

Trains computers to write better code faster.

17 Apr 2025 2

90%

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Information Retrieval

Makes small AI models as smart as big ones.

20 Feb 2025 1

90%

LLM4EFFI: Leveraging Large Language Models to Enhance Code Efficiency and Correctness

Software Engineering

Makes computer programs run much faster.

17 Feb 2025 2

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇨🇳 Hong Kong, China

Repos / Data Links

github.com

Page Count

11 pages

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Teaches computers to write better code faster.

Technical Abstract

Data-efficient LLM Fine-tuning for Code Generation

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

LLM4EFFI: Leveraging Large Language Models to Enhance Code Efficiency and Correctness