🤖 AI Summary
Existing multimodal large language models (MLLMs) exhibit significant limitations in pixel-level understanding, yet coarse-grained evaluation tasks—such as VQA and visual grounding—fail to accurately assess their fine-grained capabilities. Moreover, current segmentation approaches relying on implicit tokens and external decoders compromise the textual output space, degrade linguistic competence, and lack scalability.
Method: We propose Human-Like Mask Annotation Tasks (HLMAT), reformulating image segmentation as a multi-step text-based click decision process—enabling end-to-end pixel-level understanding and high-fidelity mask generation without architectural modifications or implicit tokens.
Contribution/Results: We introduce the first interactive segmentation paradigm grounded in human annotation trajectories, transforming pixel-level understanding into a learnable, evaluable, multi-step visual decision task—supporting mask refinement and annotation filtering. Integrating StaR reinforcement learning, PRM-guided tree search, and text-click sequence modeling, SegAgent achieves state-of-the-art performance on standard segmentation benchmarks and establishes the first pixel-level understanding evaluation protocol tailored for MLLMs.
📝 Abstract
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.