Reasoning Segmentation for Images and Videos: A Survey

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the fundamental gap between visual perception and human-like reasoning by proposing “Reasoning Segmentation” (RS)—a novel paradigm that achieves precise open-vocabulary segmentation of objects in images/videos via implicit textual queries, integrated commonsense knowledge, and logical inference. We establish the first comprehensive survey framework for RS, formally defining its three core characteristics: language-driven segmentation, knowledge-enhanced representation, and dynamic reasoning. Our systematic review encompasses 26 methods, 29 benchmark datasets, and standardized evaluation protocols. Technically, we unify multimodal large models, vision-language alignment, prompt engineering, and knowledge injection into a coherent taxonomic structure. Through empirical analysis, we identify key performance bottlenecks and propose six promising future directions: scalable reasoning architectures, embodied RS, causal segmentation, neuro-symbolic integration, interactive RS, and foundation-model-based RS. This work lays the conceptual and methodological groundwork for advancing reasoning-aware visual understanding.

Technology Category

Application Category

📝 Abstract
Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.
Problem

Research questions and friction points this paper is trying to address.

Delineate objects using implicit text queries requiring reasoning
Bridge visual perception and human-like reasoning via natural language
Survey reasoning segmentation methods, metrics, datasets, and applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Delineates objects via implicit text queries
Bridges visual perception and reasoning
Comprehensive survey of 26 methods
🔎 Similar Papers
No similar papers found.
Yiqing Shen
Yiqing Shen
Johns Hopkins
Chenjia Li
Chenjia Li
Johns Hopkins University
F
Fei Xiong
Amazon Web Services
J
Jeong-O Jeong
Amazon Web Services
T
Tianpeng Wang
Amazon Web Services
M
Michael Latman
Amazon Web Services
Mathias Unberath
Mathias Unberath
Johns Hopkins University
Medical RoboticsComputer VisionAI/MLExtended RealityHCI