🤖 AI Summary
To address the high GPU memory consumption and computational overhead in open-domain visual question answering (VQA) with multimodal large language models (MLLMs)—caused by redundant image tokens—this paper proposes a question-guided visual token compression method. The approach introduces a novel cross-modal alignment mechanism that maps question embeddings into the visual feature space via a pretrained text encoder and a learnable feed-forward layer. It further employs a cross-layer progressive soft compression strategy, dynamically selecting salient visual tokens based on relevance scores. Crucially, the method preserves discriminative information essential for VQA while reducing the number of visual tokens to only 1/8 of the original count—achieving performance on par with full-token baselines. This yields substantial reductions in GPU memory usage and inference latency, establishing a new paradigm for efficient MLLM inference.
📝 Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.