QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high GPU memory consumption and computational overhead in open-domain visual question answering (VQA) with multimodal large language models (MLLMs)—caused by redundant image tokens—this paper proposes a question-guided visual token compression method. The approach introduces a novel cross-modal alignment mechanism that maps question embeddings into the visual feature space via a pretrained text encoder and a learnable feed-forward layer. It further employs a cross-layer progressive soft compression strategy, dynamically selecting salient visual tokens based on relevance scores. Crucially, the method preserves discriminative information essential for VQA while reducing the number of visual tokens to only 1/8 of the original count—achieving performance on par with full-token baselines. This yields substantial reductions in GPU memory usage and inference latency, establishing a new paradigm for efficient MLLM inference.

Technology Category

Application Category

📝 Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Reduces GPU memory and computational overhead in MLLMs
Compresses irrelevant visual tokens for efficient VQA
Maintains performance while using fewer visual tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Question-guided visual token compression for efficiency
Progressive token reduction across vision encoder layers
Retains relevant info while discarding redundant details
🔎 Similar Papers
S
Shuai Li
BeiJing JiaoTong University, Beijing, 100044, China
J
Jian Xu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
Xiao-Hui Li
Xiao-Hui Li
Huawei;The Hong Kong University of Science and Technology
Multimodal Large Language Modelsexplainable artificial intelligencePhysics
C
Chao Deng
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
L
Lin-Lin Huang
BeiJing JiaoTong University, Beijing, 100044, China