Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

📅 2024-07-29

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the scarcity of high-quality instruction data and the prohibitive cost of manual construction in code generation, this paper proposes a tri-model co-evolutionary synthetic framework: an Instructor-LLM generates instructions, a Coder-LLM produces corresponding code, and a Judge-LLM automatically evaluates correctness; genetic operators—mutation, selection, and crossover—are applied to instructions. This work introduces the first Instructor-Coder-Judge co-evolution paradigm, enabling cold-start training with weak models and offering strong scalability and parallelism. Starting from only a small set of seed instructions, the framework efficiently synthesizes millions of high-quality instruction-code pairs. Experiments yield over 7.5 million samples; fine-tuning LLMs on this data significantly improves code generation performance, outperforming existing synthetic approaches and public datasets on benchmarks including HumanEval.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

Problem

Research questions and friction points this paper is trying to address.

Scalable synthesis of high-quality coding instructions for LLMs

Reducing reliance on expensive expert-curated datasets

Improving LLM code generation via evolutionary synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary algorithm synthesizes coding instructions

Uses LLMs for generation and quality evaluation

Scalable framework with minimal seed data

🔎 Similar Papers

No similar papers found.