Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Large language models (LLMs) frequently generate spurious chain-of-thought (CoT) reasoning for inherently unsolvable multiple-choice question answering (MCQA) tasks, leading to incorrect answers. This work introduces the novel concept of “question solvability” modeling—formally characterizing an intermediate learnable regime within the model’s capability boundary—and dynamically calibrates reasoning path credibility based on solvability estimates. Methodologically, we integrate solvability estimation into a result-supervised reward model and design a group-wise relative advantage reinforcement learning objective that explicitly penalizes spurious reasoning on low-solvability questions during training. Experiments across mathematical and multimodal MCQA benchmarks demonstrate significant improvements: +12.7% in reasoning process correctness and +5.3% in final answer accuracy. The approach substantially enhances both the reliability and robustness of LLM-based reasoning, particularly under challenging or ambiguous problem conditions.

Technology Category

Application Category

📝 Abstract

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.

Problem

Research questions and friction points this paper is trying to address.

Estimating question solvability to reduce spurious reasoning chains

Improving process-correct reasoning in multiple-choice question answering

Incorporating solvability into reinforcement learning to reduce hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling question solvability to guide learning

Adapting reward models with solvability objectives

Using group-relative advantage in reinforcement learning

🔎 Similar Papers

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step