🤖 AI Summary
This work addresses the limitation of large language models (LLMs) in delivering proactive, personalized tutoring for complex real-world tasks—particularly programming education. To this end, we propose a dialogue-based tutoring agent framework tailored for coding tasks. Methodologically, we introduce the novel Trace-and-Verify (TRAVER) workflow, which jointly integrates knowledge tracing with round-by-round dynamic verification, and design the DICT automated evaluation protocol—enabling controllable student simulation, code generation, and execution-based testing to close the tutoring loop. Our contributions are threefold: (1) the first LLM-driven, proactive, guidance-oriented coding tutor; (2) an interpretable paradigm unifying fine-grained knowledge-state tracking with iterative verification; and (3) empirical results demonstrating significant improvement in task completion rates, alongside a systematic diagnosis of critical LLM bottlenecks in complex tutoring—such as planning consistency and feedback adaptability—and actionable optimization pathways.
📝 Abstract
Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized guidance in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students toward completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents holistically using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our results and findings can be extended beyond coding, providing valuable insights into advancing tutoring agents for a variety of tasks.