🤖 AI Summary
Large language models (LLMs) achieve strong performance on code generation benchmarks but remain limited in complex software engineering tasks—such as multi-file debugging, requirement comprehension, and system-level refactoring—due to insufficient reasoning depth and lack of process controllability. To address these limitations, we propose CURA, the first coding agent framework integrating Verbal Process Supervision (VPS): a structured mechanism that explicitly models reasoning steps, dynamically calibrates intermediate states, and tightly couples multi-step code understanding with test-time process feedback. Built upon reasoning-optimized models (e.g., o3-mini), CURA achieves a 3.65% absolute improvement over strong baselines on high-difficulty benchmarks including BigCodeBench, establishing a new state-of-the-art. Its core contribution lies in shifting LLM-based coding agents from opaque, generative paradigms toward transparent, interpretable, and human-intervenable reasoning-centric frameworks.
📝 Abstract
The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.