ParGo: Bridging Vision-Language with Partial and Global Views

📅 2024-08-23

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

🤖 AI Summary

Existing vision-language alignment methods over-rely on salient image regions, leading to insufficient fine-grained understanding. To address this, we propose ParGo, a novel Part-Global projector that jointly models local and global visual features via a dual-path projection architecture—integrating detailed appearance cues with high-level semantic context. We introduce ParGoCap-1M-PT, a million-scale dataset with fine-grained human-annotated captions, enabling high-quality contrastive learning and end-to-end fine-tuning. ParGo adopts a lightweight design compatible with frozen vision encoders and large language models. Evaluated on the MME benchmark, ParGo achieves a +259.96 point improvement over prior art—including Q-Former—demonstrating substantial gains in perception and reasoning about texture, pose, spatial relations, and other fine-grained visual attributes.

Technology Category

Application Category

📝 Abstract

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

Problem

Research questions and friction points this paper is trying to address.

Visual Attention Bias

Detail Recognition

Image-Text Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

ParGo

Large Language Models

Visual Detail Processing

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Authors to Follow