Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from prevalent vision-language factual misalignment, causing generated responses to contradict the provided image-text inputs. To address this, we propose Modality Mutual Attention (MMA), the first causal attention decomposition into symmetric cross-modal mutual attention, enabling parameter-free, bidirectional, and incremental interaction between image and text tokens within a decoder-only architecture. MMA requires no additional parameters or fine-tuning, ensuring strong generality and scalability. Evaluated on 12 mainstream multimodal understanding benchmarks, MMA achieves an average improvement of 7.2%, significantly enhancing factual consistency and cross-modal reasoning capability. The implementation code and a lightweight open-source model, AKI-4B, are publicly released.

Technology Category

Application Category

📝 Abstract
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of earlier modalities (e.g., images) to incorporate information from later modalities (e.g., text). To address this problem, we propose AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code is publicly available at https://github.com/sony/aki, and we will release our AKI-4B model to encourage further advancements in MLLMs across various directions.
Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language misalignment in Multimodal Large Language Models (MLLMs).
Proposes AKI, a model enabling image tokens to attend to text tokens.
Improves performance in multimodal understanding benchmarks without extra parameters.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces modality-mutual attention (MMA) mechanism
Enables image tokens to attend to text tokens
Improves multimodal understanding benchmarks significantly
Wei-Yao Wang
Wei-Yao Wang
Sony Group Corporation, Creative AI Lab
NLPInteractive Foundation ModelsMultimodal LearningVLM
Z
Zhao Wang
Sony Group Corporation, Tokyo, Japan
H
Helen Suzuki
Sony Group Corporation, Tokyo, Japan
Y
Yoshiyuki Kobayashi
Sony Group Corporation, Tokyo, Japan