MAOAM: Unified Object and Material Selection with Vision-Language Models

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language model–driven image selection methods, which are object-centric and support only a single interaction modality, thereby struggling to accurately select non-object semantics such as materials. To overcome this, we propose MAOAM, a novel framework that unifies precise pixel-level segmentation of both objects and materials under both textual and click-based interactions. MAOAM integrates a vision-language model with a segmentation head and leverages multi-task learning—including an auxiliary visual question answering task—and a hybrid dataset combining synthetic and real material annotations. Remarkably, it exhibits emergent cross-modal coordination capabilities despite being trained in a single-modality setting. Experiments demonstrate that MAOAM consistently produces accurate and coherent selection masks across diverse objects, materials, and interaction scenarios, significantly enhancing the practicality and robustness of material-level operations in image editing.

📝 Abstract

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

Problem

Research questions and friction points this paper is trying to address.

object selection

material selection

vision-language models

interactive image editing

multimodal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model

material segmentation

interactive selection