Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Frozen multimodal large language models (MLLMs) struggle to support pixel-level segmentation tasks, while full fine-tuning often degrades their pretrained output space and cross-task generalization capabilities. Method: We propose a plug-and-play, fine-tuning-free segmentation enhancement method. Its core innovation lies in leveraging self-attention maps to identify spatially salient keypoints and introducing a lightweight trainable head that bridges frozen visual features to a mask decoder—without modifying any pretrained parameters. The method interfaces solely with intermediate visual features of the frozen MLLM, preserving its original output distribution and linguistic understanding. Contribution/Results: Our approach achieves competitive or superior performance against full-model fine-tuning across multiple segmentation benchmarks, while fully retaining the MLLM’s generalization capacity in open-ended dialogue, cross-modal reasoning, and other vision-language tasks—establishing a new paradigm for unified, multi-task visual-language modeling.

Technology Category

Application Category

📝 Abstract

Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

Problem

Research questions and friction points this paper is trying to address.

Enabling pixel-level segmentation in frozen multimodal LLMs without finetuning

Preserving model generalization while adding segmentation capabilities

Creating plug-and-play segmentation for unified multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play segmentation for frozen MLLMs

Lightweight head refines attention maps for keypoints

Preserves generalization without finetuning the model

🔎 Similar Papers

No similar papers found.

Authors to Follow