LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the high computational overhead of native-resolution global visual encoding in multimodal large language models (MLLMs), this paper proposes a progressive visual compression method built upon the Vision Transformer (ViT) architecture. Our approach innovatively integrates fine-grained block-wise embedding, hierarchical windowed token compression, dynamic patch-size scaling, and cross-layer token aggregation—enabling efficient, fine-grained visual modeling without compromising model generality. Experiments demonstrate that, under identical architectural configurations, our method reduces first-token latency by 2.4× compared to MoonViT and 1.9× compared to Qwen2-VL, while maintaining competitive performance across diverse multimodal understanding benchmarks. This work provides a lightweight, flexible, and high-performance solution for high-resolution visual encoding in MLLMs.

Technology Category

Application Category

📝 Abstract

Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Global native-resolution visual encoding in MLLMs creates high computational overhead

Standard Vision Transformers lack efficient compression for native-resolution processing

Existing MLLM architectures suffer from slow time-to-first-token performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Visual Compression for efficient encoding

Refined patch embedding with flexible scaling

Windowed token compression across ViT layers

🔎 Similar Papers

No similar papers found.

Authors to Follow