When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient reliability of step-level feedback from large language models in symbolic reasoning tutoring and the lack of fine-grained evaluation. The authors construct a knowledge graph benchmark comprising 516 propositional logic proof states and propose a Tutor-Teacher-Judge multi-agent pipeline to assess feedback quality under varying information access conditions. They introduce verification mechanisms and complexity metrics to analyze performance, revealing for the first time an asymmetric effect of verification: it improves performance under low-reliability feedback but reduces accuracy by 4–6 percentage points under high-reliability conditions due to over-constraining. Furthermore, all methods exhibit a pronounced performance bottleneck on proof tasks with complexity levels of 4–5 or higher, underscoring the need for adaptive tutoring architectures that route strategies based on both task complexity and feedback reliability.

📝 Abstract

Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.

Problem

Research questions and friction points this paper is trying to address.

logic proof tutoring

multi-agent feedback

verification

large language models

symbolic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent feedback

logic proof tutoring

verification asymmetry

knowledge-graph benchmark

adaptive tutoring architecture

🔎 Similar Papers

No similar papers found.

Authors to Follow