SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) for code rely heavily on proprietary models to generate large-scale instruction data, incurring prohibitive computational and financial costs. To address this, we propose an iterative self-distillation framework leveraging small-scale open-weight LLMs (e.g., 7B-parameter models), which autonomously synthesizes high-quality code instruction data via multi-checkpoint sampling, multi-dimensional automatic scoring, and gradient influence estimation. This approach eliminates dependence on closed-source models and drastically reduces data construction cost. Models in the SCoder series—trained exclusively on our synthesized data—achieve state-of-the-art performance across major code generation benchmarks, demonstrating the feasibility and effectiveness of using compact open models to generate high-fidelity instruction data. Our core contributions are threefold: (i) the first integration of gradient influence estimation into instruction data filtering; (ii) a fully open, end-to-end pipeline for instruction data synthesis; and (iii) a paradigm that jointly ensures low cost, openness, and competitive performance.

Technology Category

Application Category

📝 Abstract
Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly proprietary LLMs for code instruction data
Enhancing small-scale LLMs as synthesizers for code generation
Developing iterative self-distillation to bootstrap data synthesis capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative self-distillation bootstraps small LLMs
Multi-checkpoint sampling enhances data diversity
Gradient-based influence estimation filters samples
🔎 Similar Papers
No similar papers found.
X
Xinyu Zhang
School of Computer Science, Beijing Institute of Technology
C
Changzhi Zhou
School of Computer Science, Beijing Institute of Technology
Linmei Hu
Linmei Hu
Beijing Institute of Technology
Large Language ModelsKnowledge GraphMulitimodal
Luhao Zhang
Luhao Zhang
School of Computer Science, Beijing Institute of Technology
X
Xiancai Chen
School of Computer Science, Peking University
Haomin Fu
Haomin Fu
LongCat Team, Meituan
Large Language ModelCode Language ModelAgentic Coding Model
Y
Yang Yang
Meituan
M
Mengdi Zhang
Meituan