TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) exhibit insufficient reliability in scientific and mathematical reasoning, and conventional final-answer evaluation often masks intermediate reasoning errors. To address this, we propose TRACE, a framework that decomposes complex problems into verifiable substeps via Auxiliary Reasoning Sets (ARS), enabling transparent, fine-grained assessment of reasoning trajectories through consistency-based metrics. TRACE further defines confidence regions to explicitly distinguish reliable from unreliable reasoning paths. Experiments demonstrate that ARS consistency strongly correlates with final answer correctness (Spearman’s ρ > 0.92), enables precise localization of erroneous steps, and significantly enhances reasoning interpretability and robustness. By providing trustworthy, process-level supervision signals, TRACE facilitates effective debugging and optimization of VLMs—offering a principled approach to reasoning evaluation beyond black-box answer scoring.

Technology Category

Application Category

📝 Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
Problem

Research questions and friction points this paper is trying to address.

Diagnoses reasoning errors in vision-language models beyond final answers.
Evaluates intermediate steps using consistency metrics and auxiliary sub-questions.
Distinguishes reliable from unreliable reasoning paths for model improvement.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes complex problems with auxiliary reasoning sets
Evaluates intermediate steps using consistency-based metrics
Defines confidence regions to filter unreliable reasoning paths
🔎 Similar Papers
No similar papers found.