LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of native-resolution global visual encoding in multimodal large language models (MLLMs), this paper proposes a progressive visual compression method built upon the Vision Transformer (ViT) architecture. Our approach innovatively integrates fine-grained block-wise embedding, hierarchical windowed token compression, dynamic patch-size scaling, and cross-layer token aggregation—enabling efficient, fine-grained visual modeling without compromising model generality. Experiments demonstrate that, under identical architectural configurations, our method reduces first-token latency by 2.4× compared to MoonViT and 1.9× compared to Qwen2-VL, while maintaining competitive performance across diverse multimodal understanding benchmarks. This work provides a lightweight, flexible, and high-performance solution for high-resolution visual encoding in MLLMs.

Technology Category

Application Category

📝 Abstract
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Global native-resolution visual encoding in MLLMs creates high computational overhead
Standard Vision Transformers lack efficient compression for native-resolution processing
Existing MLLM architectures suffer from slow time-to-first-token performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Visual Compression for efficient encoding
Refined patch embedding with flexible scaling
Windowed token compression across ViT layers
🔎 Similar Papers
No similar papers found.
S
Shichu Sun
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences
Y
Yichen Zhang
Tsinghua University
H
Haolin Song
University of Chinese Academy of Sciences
Zonghao Guo
Zonghao Guo
University of Chinese Academy of Sciences
C
Chi Chen
Tsinghua University
Yidan Zhang
Yidan Zhang
PhD Student, the Chinese University of Hong Kong, Shenzhen
computer visiondeep learning
Y
Yuan Yao
Tsinghua University
Z
Zhiyuan Liu
Tsinghua University
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing