🤖 AI Summary
To address the tripartite trade-off among conversational capability, segmentation accuracy, and inference speed in multimodal large language models (MLLMs) for segmentation tasks, this paper introduces a novel *full-mask prediction* paradigm. It treats image patches as fill-in-the-blank units and employs non-autoregressive, parallel bidirectional spatial context modeling to jointly generate textual responses and complete segmentation masks in a single forward pass—effectively decoupling text generation from mask prediction. This is the first approach to unify text and mask generation within an MLLM framework, eliminating the sequential bottleneck inherent in autoregressive segmentation methods. Evaluated on multiple segmentation benchmarks, our method substantially surpasses state-of-the-art approaches: it achieves significant gains in segmentation accuracy while preserving strong conversational proficiency and accelerates inference by 2–5×. To our knowledge, it is the first work to simultaneously optimize all three objectives—accuracy, efficiency, and dialogue capability—in a single MLLM-based segmentation architecture.
📝 Abstract
Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.