Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

📅 2024-12-01
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from rapidly escalating computational and memory overhead during autoregressive decoding, as costs scale linearly with output length. While existing visual context compression methods improve prefill efficiency, their benefits diminish significantly during decoding. To address this, we propose a dynamic vision-language contextual sparsification framework—the first to adaptively prune context based on inference phase (prefill vs. decode) and KV cache availability, overcoming the limitations of static pruning. Our approach integrates dynamic context pruning, phase-aware sparsity scheduling, KV-cache-aware memory optimization, and joint vision-language redundancy modeling. It enables end-to-end efficient inference: 75% reduction in prefill computation, 50% reduction in decode computation without KV caching, and 50% reduction in memory footprint with KV caching—while maintaining or improving multi-task accuracy and incurring negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $sim$75% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $sim$50% computation consumption under decoding without KV cache, while saving $sim$50% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .
Problem

Research questions and friction points this paper is trying to address.

Reduces vision context redundancy in MLLMs
Decreases memory and computation overhead during decoding
Achieves efficient inference with minimal performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic vision-language context sparsification framework
Tailored sparsification for prefill and decoding stages
Reduces computation and memory overhead significantly
Wenxuan Huang
Wenxuan Huang
CUHK & ECNU
Artificial General IntelligenceMLLMLLMAIGCModel Acceleration
Z
Zijie Zhai
East China Normal University
Y
Yunhang Shen
Xiamen University
S
Shaoshen Cao
Xiaohongshu
F
Fei Zhao
Nanjing University
X
Xiangfeng Xu
East China Normal University
Zheyu Ye
Zheyu Ye
Imperial College London
Language ModelsAI Agents
S
Shaohui Lin
East China Normal University