🤖 AI Summary
Large language models (LLMs) often generate syntactically fluent yet semantically misaligned or logically incorrect code, especially on structured reasoning tasks; existing Chain-of-Thought (CoT) prompting is verbose and inefficient, while Chain-of-Draft (CoD) sacrifices output stability due to inherent randomness. Method: We propose a reinforcement learning–guided Chain-of-Draft framework that formulates solution selection as a contextual bandit problem, dynamically ranking multiple candidate drafts using interpretable features—e.g., code complexity and reasoning structural consistency—and integrating policy-guided prompting with a pay-only-for-selected-output billing mechanism. Results: Evaluated on MBPP and BigCodeBench, our method matches or exceeds CoT/CoD in functional correctness while significantly improving token efficiency—reducing user-side inference cost by over 50%. It thus advances the trade-off among correctness, computational efficiency, and deployment sustainability.
📝 Abstract
LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50% and improving LLM response quality, making multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.