MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high inference latency during the prefill phase of long-context vision-language models (VLMs) on multimodal inputs (e.g., video + text) with million-token sequences—caused by quadratic attention complexity—this work identifies, for the first time, a learnable grid-structured sparsity pattern inherent in video tokens. We propose a modality-aware permuted sparse attention mechanism: it enables offline per-head optimization of sparse patterns and dynamic distribution construction, requiring no model architecture changes or fine-tuning. By jointly leveraging grid-pattern modeling and custom GPU-accelerated sparse kernels, our method achieves up to 8.3× prefill speedup on 1M-token inputs with zero accuracy degradation. It generalizes across diverse downstream tasks—including Video QA, captioning, and VisionNIAH—and is compatible with leading long-context VLMs such as LongVILA and LLaVA-Video.

Technology Category

Application Category

📝 Abstract

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

Problem

Research questions and friction points this paper is trying to address.

Accelerates pre-filling for long-context VLMs

Reduces quadratic attention complexity in VLMs

Handles multi-modal sparse patterns efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sparse attention for multi-modal inputs

Permutation-based method for Grid pattern utilization

Optimized GPU kernels for efficient sparse computations

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval