Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation

📅 2024-05-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in Generalized Referring Expression Segmentation (GRES)—including multi-object references, ambiguous descriptions, and the absence of explicit visual anchors—this paper proposes a query-driven adaptive prototype binding mechanism. Our method dynamically binds query vectors to either distinct instance categories or local regions within the same instance, overcoming the rigidity of conventional encoder-decoder architectures in prototype generation. Coupled with region-aware prototype generation and a lightweight decoder design, our approach eliminates the need for complex feature fusion modules. Evaluated on standard benchmarks, our method achieves state-of-the-art performance on all three gRefCOCO splits, significantly outperforms prior works on RefCOCO+ and G-Ref, and remains highly competitive on RefCOCO. Moreover, it substantially enhances localization robustness and decoding flexibility, demonstrating superior generalization across diverse referring expression scenarios.

Technology Category

Application Category

📝 Abstract
Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions. While substantial progress has been made in RES, the emergence of Generalized Referring Expression Segmentation (GRES) introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules, and are difficult to generate class prototypes that match each instance individually when confronted with the complex referent and binary labels of GRES. In this paper, reevaluating the differences between RES and GRES, we propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region. It enables different query vectors to match instances of different categories or different parts of the same instance, significantly expanding the decoder's flexibility, dispersing global pressure across all queries, and easing the demands on the encoder. Experimental results demonstrate that MABP significantly outperforms state-of-the-art methods in all three splits on gRefCOCO dataset. Meanwhile, MABP also surpasses state-of-the-art methods on RefCOCO+ and G-Ref datasets, and achieves very competitive results on RefCOCO. Code is available at https://github.com/buptLwz/MABP
Problem

Research questions and friction points this paper is trying to address.

Multi-object recognition
Natural language description
Image segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MABP
Multimodal Matching
Object Recognition
🔎 Similar Papers
No similar papers found.
W
Weize Li
Beijing Key Laboratory of Network System and Network Culture, Key Laboratory of Interactive Technology and Experience System Ministry of Culture and Tourism, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Zhicheng Zhao
Zhicheng Zhao
Associate Professor at the School of Artificial Intelligence, Anhui University
Computer Vision
H
Haochen Bai
Beijing Key Laboratory of Network System and Network Culture, Key Laboratory of Interactive Technology and Experience System Ministry of Culture and Tourism, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
F
Fei Su
Beijing Key Laboratory of Network System and Network Culture, Key Laboratory of Interactive Technology and Experience System Ministry of Culture and Tourism, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China