When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient reliability of step-level feedback from large language models in symbolic reasoning tutoring and the lack of fine-grained evaluation. The authors construct a knowledge graph benchmark comprising 516 propositional logic proof states and propose a Tutor-Teacher-Judge multi-agent pipeline to assess feedback quality under varying information access conditions. They introduce verification mechanisms and complexity metrics to analyze performance, revealing for the first time an asymmetric effect of verification: it improves performance under low-reliability feedback but reduces accuracy by 4–6 percentage points under high-reliability conditions due to over-constraining. Furthermore, all methods exhibit a pronounced performance bottleneck on proof tasks with complexity levels of 4–5 or higher, underscoring the need for adaptive tutoring architectures that route strategies based on both task complexity and feedback reliability.
📝 Abstract
Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.
Problem

Research questions and friction points this paper is trying to address.

logic proof tutoring
multi-agent feedback
verification
large language models
symbolic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent feedback
logic proof tutoring
verification asymmetry
knowledge-graph benchmark
adaptive tutoring architecture
🔎 Similar Papers
No similar papers found.
T
Tahreem Yasir
North Carolina State University
S
Sutapa Dey Tithi
North Carolina State University
B
Benyamin Tabarsi
North Carolina State University
D
Dmitri Droujkov
North Carolina State University
S
Sam Gilson Yasitha Rajapaksha
North Carolina State University
X
Xiaoyi Tian
North Carolina State University
D
DongKuan Xu
North Carolina State University
Tiffany Barnes
Tiffany Barnes
Distinguished Professor of Computer Science, North Carolina State University
Educational data miningSerious GamesArtificial IntelligenceBroadening ParticipationCS Education
A
Arun Ramesh