PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 2

career value

188K/year

🤖 AI Summary

Existing vision-language models exhibit a capability gap between pixel-level grounding (single-image) and multi-image reasoning (coarse-grained), hindering cross-image fine-grained visual comparison and localization. To bridge this gap, we introduce *multi-image pixel-level reasoning segmentation*—a novel task requiring joint understanding of multiple images with precise pixel-level alignment. We propose PRIMA, the first vision-language model unifying multi-image comprehension and pixel-level grounding. Its core innovations include: (1) a unified multi-image alignment encoder with pixel-adaptive feature mapping; (2) an efficient cross-image visual query module reducing computational overhead by 25.3%; and (3) M⁴Seg, the first large-scale benchmark for this task (224K QA pairs). Trained via joint cross-image attention, instruction tuning, and contrastive learning, PRIMA achieves significant gains over state-of-the-art methods on M⁴Seg, demonstrating both effectiveness and generalizability of multi-image pixel-level grounding reasoning.

Technology Category

Application Category

📝 Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Develops multi-image pixel-grounded reasoning for fine-grained comparisons

Integrates pixel-level grounding with multi-image reasoning for rich explanations

Creates a benchmark for multi-image reasoning segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates pixel-level grounding with multi-image reasoning

Uses SQuARE module to inject cross-image relational context

Introduces M4SEG benchmark for multi-image segmentation evaluation

🔎 Similar Papers

No similar papers found.