Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational inefficiency of multimodal large language models (MLLMs) when processing long visual token sequences, which stems from the quadratic complexity of self-attention and redundant deep visual self-attention layers. To mitigate this, the authors propose Visual-Skip (V-Skip), a training-free inference optimization method that identifies and skips saturated visual self-attention modules in deeper layers, thereby inducing block-wise structured sparsity. V-Skip decouples spatial interaction from semantic evolution and leverages a lightweight few-shot calibration procedure to dynamically select the optimal sparse path during inference. Experiments demonstrate that V-Skip significantly accelerates inference across multiple mainstream MLLMs while preserving 94.16%–100.31% of original task performance.
📝 Abstract
Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see -- they simply need to "look less" at the right depth.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Visual Attention Saturation
Inference Efficiency
Self-Attention
Computational Redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Attention Saturation
Block-wise Sparsity
Training-free Inference
Multimodal LLMs
Dynamic Routing