Learning Free Token Reduction for Multi-Modal LLM

πŸ“… 2025-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost and inference latency in visual reasoning with multimodal large language models (MLLMs), this paper proposes a training-free, plug-and-play cross-temporal visual token compression method. The approach introduces a novel β€œlearning-free” paradigm that requires no gradient-based optimization and preserves the original model architecture intact. It employs handcrafted spatial downsampling and temporal frame selection strategies, jointly constrained by semantic fidelity to effectively mitigate visual redundancy while maintaining representation quality. Evaluated on Video-QA benchmarks, the method achieves a 2.3Γ— throughput improvement and substantial latency reduction, with accuracy degradation limited to less than 0.4%, thereby striking an excellent trade-off between efficiency and performance.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks; however, their practical deployment is often constrained by high computational costs and prolonged inference times. Since the vision modality typically carries more information than the text modality, compressing visual prompts offers a promising solution to alleviate these challenges. Existing approaches predominantly focus on refining model architectures or directly reducing the number of visual tokens. However, these methods often compromise inference performance due to a lack of consideration for the unique spatial and temporal characteristics of visual data. In this work, we propose a token compression paradigm that operates on both spatial and temporal dimensions. Our approach includes a learning-free, plug-and-play compression pipeline that can be seamlessly integrated into most Multimodal Large Language Model (MLLM) frameworks. By leveraging this method, we enhance the model inference capability while simultaneously reducing its computational cost. Experimental results on the Video-QA task demonstrate the effectiveness of the proposed approach, showcasing significant improvements in efficiency without sacrificing performance.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Model
Efficiency Optimization
Spatial Temporal Characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficiency Enhancement
Visual-Language Modeling
Multi-modal Model Optimization
πŸ”Ž Similar Papers
No similar papers found.
Z
Zihui Zhao
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong 518055, China
Yingxin Li
Yingxin Li
Tsinghua University
LLMVLMEfficient ML
Y
Yang Li
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong 518055, China