Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) often generate syntactically fluent yet semantically misaligned or logically incorrect code, especially on structured reasoning tasks; existing Chain-of-Thought (CoT) prompting is verbose and inefficient, while Chain-of-Draft (CoD) sacrifices output stability due to inherent randomness. Method: We propose a reinforcement learning–guided Chain-of-Draft framework that formulates solution selection as a contextual bandit problem, dynamically ranking multiple candidate drafts using interpretable features—e.g., code complexity and reasoning structural consistency—and integrating policy-guided prompting with a pay-only-for-selected-output billing mechanism. Results: Evaluated on MBPP and BigCodeBench, our method matches or exceeds CoT/CoD in functional correctness while significantly improving token efficiency—reducing user-side inference cost by over 50%. It thus advances the trade-off among correctness, computational efficiency, and deployment sustainability.

Technology Category

Application Category

📝 Abstract

LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50% and improving LLM response quality, making multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.

Problem

Research questions and friction points this paper is trying to address.

Improving code generation correctness and semantic alignment in LLMs

Optimizing token efficiency by selecting best draft from multiple candidates

Reducing computational costs while maintaining solution quality through reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning selects best draft from multiple candidates

Models solution selection as contextual bandit problem

Optimizes interpretable features via reward balancing correctness efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow