Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge in referring expression segmentation (RES) where multimodal large language models (MLLMs) struggle to simultaneously achieve high accuracy and low computational cost, this paper proposes MLLMSeg—a novel framework that fully exploits fine-grained features from deep layers of the MLLM’s vision encoder and fuses them with semantic representations from a large language model. It introduces a lightweight mask decoder (34M parameters) and a Detail-enhanced Semantic-consistent Fusion module (DSFF). Crucially, MLLMSeg avoids auxiliary vision encoders or reliance on the computationally heavy Segment Anything Model (SAM, 632M parameters). Extensive experiments demonstrate that MLLMSeg outperforms both SAM-based and existing lightweight RES methods across multiple benchmarks, achieving superior accuracy–efficiency trade-offs and enhanced practical deployability.

Technology Category

Application Category

📝 Abstract
Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.
Problem

Research questions and friction points this paper is trying to address.

Enhancing pixel-level dense prediction in MLLMs for RES
Balancing accuracy and cost in referring expression segmentation
Integrating visual detail features with semantic understanding efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight mask decoder with 34M parameters
Detail-enhanced semantic-consistent feature fusion module
Utilizes MLLM vision encoder without extra components
🔎 Similar Papers
No similar papers found.
Jingchao Wang
Jingchao Wang
East China Normal University
AI
Z
Zhijian Wu
Medical Artificial Intelligence Laboratory, Westlake University
D
Dingjiang Huang
School of Data Science and Engineering, East China Normal University
Yefeng Zheng
Yefeng Zheng
Professor, Westlake University, Hangzhou, China, IEEE Fellow, AIMBE Fellow
AI in HealthMedical ImagingComputer VisionNatural Language ProcessingLarge Language Model
H
Hong Wang
School of Life Science and Technology, Xi’an Jiaotong University