Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the high computational cost and inference latency in multimodal large language models caused by the excessive number of visual tokens. Existing compression methods often induce representation distortion due to their neglect of positional and attentional consistency. To tackle this issue, the authors propose RESTORE, a novel framework that systematically identifies and corrects these two types of distortion for the first time. RESTORE enhances positional awareness through a relative-distance-based attention weight calibration mechanism and introduces an information-preserving anchor selection strategy to guide token merging. Extensive experiments demonstrate that RESTORE significantly outperforms current compression techniques across multiple benchmarks, achieving state-of-the-art multimodal reasoning performance while maintaining high computational efficiency.

📝 Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Visual Token Reduction

Multimodal LLM

Computational Complexity

Representation Distortion

Attention Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Token Reduction

Positional Distortion Rectification

Attention Calibration