ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current large language models (LLMs) lack the ability to recognize ambiguous user requirements and proactively seek clarifications during code generation, often leading to incorrect implementations. To address this, we propose the first formal modeling of “clarification awareness” as a foundational capability for code LLMs, introducing a clarification-aware joint optimization paradigm. Our approach synthesizes high-quality clarification-scenario data and integrates instruction tuning with multi-stage, fine-grained training to jointly enhance both clarification dialogue proficiency and programming competence. A key innovation is the engineering-inspired “confirm-then-code” reasoning mechanism, which explicitly decouples requirement validation from code synthesis. Experiments demonstrate substantial improvements: the model’s capacity to initiate meaningful clarification dialogues increases significantly, clarification accuracy improves by 42%, and user requirement satisfaction rises by 31%, all while preserving competitive performance across standard programming benchmarks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, a significant gap remains between their current performance and that of expert software engineers. A key differentiator is that human engineers actively seek clarification when faced with ambiguous requirements, while LLMs typically generate code regardless of uncertainties in the problem description. We present ClarifyCoder, a novel framework with synthetic data generation and instruction-tuning that enables LLMs to identify ambiguities and request clarification before proceeding with code generation. While recent work has focused on LLM-based agents for iterative code generation, we argue that the fundamental ability to recognize and query ambiguous requirements should be intrinsic to the models themselves. Our approach consists of two main components: (1) a data synthesis technique that augments existing programming datasets with scenarios requiring clarification to generate clarification-aware training data, and (2) a fine-tuning strategy that teaches models to prioritize seeking clarification over immediate code generation when faced with incomplete or ambiguous requirements. We further provide an empirical analysis of integrating ClarifyCoder with standard fine-tuning for a joint optimization of both clarify-awareness and coding ability. Experimental results demonstrate that ClarifyCoder significantly improves the communication capabilities of Code LLMs through meaningful clarification dialogues while maintaining code generation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enables LLMs to identify ambiguities in code requirements

Teaches models to request clarification before generating code

Improves communication and code generation in ambiguous scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for clarification scenarios

Instruction-tuning for ambiguity recognition

Joint optimization of clarify-awareness and coding

🔎 Similar Papers

MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming Tasks