🤖 AI Summary
Existing vision-language alignment methods over-rely on salient image regions, leading to insufficient fine-grained understanding. To address this, we propose ParGo, a novel Part-Global projector that jointly models local and global visual features via a dual-path projection architecture—integrating detailed appearance cues with high-level semantic context. We introduce ParGoCap-1M-PT, a million-scale dataset with fine-grained human-annotated captions, enabling high-quality contrastive learning and end-to-end fine-tuning. ParGo adopts a lightweight design compatible with frozen vision encoders and large language models. Evaluated on the MME benchmark, ParGo achieves a +259.96 point improvement over prior art—including Q-Former—demonstrating substantial gains in perception and reasoning about texture, pose, spatial relations, and other fine-grained visual attributes.
📝 Abstract
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.