Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive KV cache expansion and high inference overhead in large vision-language models (LVLMs) caused by a large number of image tokens in cross-attention layers, this paper proposes a training-free, sparsity-driven method for dynamic visual token pruning. Unlike existing approaches that exploit self-attention sparsity, our work is the first to explicitly model and leverage the intrinsic sparsity of cross-attention maps to compress visual features across cross-attention layers. The method requires no fine-tuning or additional training and is plug-and-play on architectures such as LLaMA-3.2-Vision. Experiments show that reducing visual tokens by 50% yields significant reductions in GPU memory consumption and inference latency, while preserving full performance on multimodal understanding benchmarks (e.g., MMBench, OCRBench). Our core contribution is the first explicit utilization of cross-attention sparsity for visual token compression—extending beyond the prevailing self-attention-centric paradigm and establishing a new direction for efficient LVLM inference.

Technology Category

Application Category

📝 Abstract
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache size for image tokens in cross-attention layers
Prunes redundant visual features using sparse cross-attention maps
Lowers inference latency and memory usage without extra training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trims cross-attended visual features efficiently
Reduces KV cache size without extra training
Cuts inference latency and memory usage significantly
🔎 Similar Papers
No similar papers found.
Jewon Lee
Jewon Lee
Nota Inc.
AI
Ki-Ung Song
Ki-Ung Song
AI Research Engineer at Nota AI
Deep LearningArtificial Intelligence
Seungmin Yang
Seungmin Yang
Nota Inc.
D
Donguk Lim
Nota Inc.
J
Jaeyeon Kim
Nota Inc.
W
Wooksu Shin
Nota Inc.
Bo-Kyeong Kim
Bo-Kyeong Kim
Nota Inc.
Machine LearningArtificial Intelligence
Y
Yong Jae Lee
University of Wisconsin-Madison
T
Tae-Ho Kim
Nota Inc.