Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current large language models for code suffer from low data efficiency due to a prevalent “quantity-over-quality” training paradigm. To address this, we propose a parameterized data selection method that jointly enforces distributional consistency and sample diversity: a learnable model explicitly captures the distributional characteristics of high-quality code, enabling diversity-aware sampling while preserving the original data distribution. Evaluated on HumanEval and MBPP, our method achieves +2.4% and +2.3% absolute performance gains using only 10K samples—surpassing the baseline trained on the full 92K dataset, with substantially reduced computational cost. Our key contribution is formulating data selection as a diversity optimization problem subject to distributional constraints, thereby enabling, for the first time, intelligent, high-fidelity, and high-coverage code data curation under extreme data scarcity.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

Problem

Research questions and friction points this paper is trying to address.

Improves training efficiency of code LLMs

Ensures high-quality data via distribution consistency

Enhances model performance with fewer samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric model for code data selection

Ensures distribution consistency and diversity

Boosts performance with fewer samples

🔎 Similar Papers

No similar papers found.