🤖 AI Summary
Large vision-language models (VLMs) exhibit insufficient reliability in scientific and mathematical reasoning, and conventional final-answer evaluation often masks intermediate reasoning errors. To address this, we propose TRACE, a framework that decomposes complex problems into verifiable substeps via Auxiliary Reasoning Sets (ARS), enabling transparent, fine-grained assessment of reasoning trajectories through consistency-based metrics. TRACE further defines confidence regions to explicitly distinguish reliable from unreliable reasoning paths. Experiments demonstrate that ARS consistency strongly correlates with final answer correctness (Spearman’s ρ > 0.92), enables precise localization of erroneous steps, and significantly enhances reasoning interpretability and robustness. By providing trustworthy, process-level supervision signals, TRACE facilitates effective debugging and optimization of VLMs—offering a principled approach to reasoning evaluation beyond black-box answer scoring.
📝 Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.