Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing multimodal large language models (MLLMs) in complex reasoning-based image segmentation, which stem from insufficient training data and a capability gap between language reasoning and mask generation modules. To overcome these challenges, the authors propose Rea2Seg, a two-stage framework that first generates candidate regions using MLLM attention maps and then employs the MLLM to comparatively reason about the input query and candidate masks, assigning scores for final re-ranking to select the optimal mask. This approach innovatively reframes segmentation as a two-step process of candidate discovery followed by discriminative selection. The study further introduces ReasonSeg-SGDR, a new benchmark designed to comprehensively evaluate models on perception, localization, and multidimensional reasoning capabilities. Experiments demonstrate that Rea2Seg significantly improves fine-grained segmentation performance in complex scenes on both ReasonSeg and ReasonSeg-SGDR benchmarks.

📝 Abstract

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

Problem

Research questions and friction points this paper is trying to address.

image segmentation

multimodal large language models

complex reasoning

mask generation

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage segmentation

candidate discovery

comparative reasoning