🤖 AI Summary
This work investigates the hypothetical reasoning capabilities of multimodal large language models (MLLMs) under predefined perturbations and identifies a pervasive failure in compositional hypothetical reasoning. To address this, we introduce MARS-Bench—the first dedicated benchmark for evaluating hypothetical reasoning in MLLMs—and propose Active Deduction (AD), a novel reinforcement learning paradigm. AD explicitly guides models through stepwise, prompt-driven composite reasoning, multimodal instruction tuning, and predefined sensitivity modeling. Crucially, AD achieves the first simultaneous improvement in both hypothetical reasoning and general-purpose question answering (QA). On MARS-Bench, it boosts hypothetical reasoning accuracy by an average of 23.6% across 12 prominent open- and closed-source MLLMs, without degrading general QA performance. Furthermore, AD provides an interpretable framework for analyzing reasoning trajectories, enabling transparent diagnosis and validation of hypothetical inference processes.
📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a extbf{M}ultimodal extbf{A}ssumptive extbf{R}ea extbf{s}oning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.