SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation

📅 2024-02-29
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of customizing large language models (LLMs) for code generation in low-resource settings, this paper proposes an error-driven adaptive fine-tuning framework. The method leverages the model’s own generation errors as self-supervised signals: a Self-Revise mechanism identifies and corrects erroneous outputs, while error-guided iterative fine-tuning and generation-feedback-driven parameter optimization enable efficient domain adaptation—without requiring additional human annotations and using only a few examples. Its core contribution is the novel “error-as-supervision” self-correction learning paradigm. Experiments demonstrate that the approach achieves an average relative improvement of 54.7% in Pass@1 across multiple code-generation benchmarks, significantly outperforming mainstream fine-tuning methods. Moreover, it exhibits strong generalization across diverse LLM architectures.

Technology Category

Application Category

📝 Abstract
Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training samples available in practice lead to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training samples is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named SEED, which stands for Sample-Efficient adaptation with Error-Driven learning for code generation. SEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome its own shortcomings, thus achieving efficient learning. Specifically, SEED involves identifying error code generated by LLMs, employing Self-revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, SEED achieves superior performance with few training samples, showing an average relative improvement of 54.7% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, SEED consistently demonstrates strong performance across various LLMs, underscoring its generalizability.
Problem

Research questions and friction points this paper is trying to address.

Adapting large language models to specific code generation scenarios
Overcoming limited training data for effective model adaptation
Improving code generation performance through error-driven learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-driven learning leverages LLM mistakes for adaptation
Self-Revise mechanism iteratively improves code generation quality
Data-efficient optimization uses revised code samples over datasets
🔎 Similar Papers
No similar papers found.
X
Xue Jiang
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China
Yihong Dong
Yihong Dong
Peking University
Code GenerationLarge Language Models
Zhi Jin
Zhi Jin
Sun Yat-Sen University, Associate Professor
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning