🤖 AI Summary
This work addresses the weak performance of current vision-language models (VLMs) on multimodal symbolic logical reasoning—particularly formal-logic-based derivation of novel facts. To this end, we introduce MuSLR, the first rigorously formal-logic-compliant multimodal benchmark, comprising 1,093 instances across seven domains. We propose LogiCAM, a modular framework that explicitly decouples multimodal perception from atomic and compound logical structures, integrating chain-of-thought reasoning with logic-rule-driven inference. On MuSLR, state-of-the-art VLMs achieve only 46.8% accuracy; LogiCAM boosts GPT-4.1’s accuracy by 14.13%, with especially pronounced gains on complex first-order logic tasks. Furthermore, we provide the first systematic analysis identifying cross-modal logical misalignment as a key failure mode. All benchmark data, annotations, and source code are publicly released to foster reproducible research in multimodal logical reasoning.
📝 Abstract
Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.