🤖 AI Summary
Large language models (LLMs) frequently generate spurious chain-of-thought (CoT) reasoning for inherently unsolvable multiple-choice question answering (MCQA) tasks, leading to incorrect answers. This work introduces the novel concept of “question solvability” modeling—formally characterizing an intermediate learnable regime within the model’s capability boundary—and dynamically calibrates reasoning path credibility based on solvability estimates. Methodologically, we integrate solvability estimation into a result-supervised reward model and design a group-wise relative advantage reinforcement learning objective that explicitly penalizes spurious reasoning on low-solvability questions during training. Experiments across mathematical and multimodal MCQA benchmarks demonstrate significant improvements: +12.7% in reasoning process correctness and +5.3% in final answer accuracy. The approach substantially enhances both the reliability and robustness of LLM-based reasoning, particularly under challenging or ambiguous problem conditions.
📝 Abstract
Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.