MuSLR: Multimodal Symbolic Logical Reasoning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the weak performance of current vision-language models (VLMs) on multimodal symbolic logical reasoning—particularly formal-logic-based derivation of novel facts. To this end, we introduce MuSLR, the first rigorously formal-logic-compliant multimodal benchmark, comprising 1,093 instances across seven domains. We propose LogiCAM, a modular framework that explicitly decouples multimodal perception from atomic and compound logical structures, integrating chain-of-thought reasoning with logic-rule-driven inference. On MuSLR, state-of-the-art VLMs achieve only 46.8% accuracy; LogiCAM boosts GPT-4.1’s accuracy by 14.13%, with especially pronounced gains on complex first-order logic tasks. Furthermore, we provide the first systematic analysis identifying cross-modal logical misalignment as a key failure mode. All benchmark data, annotations, and source code are publicly released to foster reproducible research in multimodal logical reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates multimodal symbolic logical reasoning in VLMs

Models struggle with formal logic deduction from multimodal inputs

Proposes framework to improve logical alignment across vision-language modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MuSLR benchmark for multimodal symbolic reasoning

Proposes LogiCAM framework applying formal logic rules

Boosts Chain-of-Thought performance by 14.13%

🔎 Similar Papers

No similar papers found.

Authors to Follow