Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses label noise in action-driven video object segmentation, systematically investigating—for the first time—the joint impact of text prompt noise (e.g., category flipping or noun substitution) and mask annotation noise (e.g., boundary perturbation). To this end, we introduce ActiSeg-NL, the first multimodal noise benchmark tailored for embodied intelligence. We further propose a parallel mask head architecture: a dual-branch design separately models original action semantics and robust segmentation representations, integrating adversarial mask perturbations and text replacement strategies to enhance generalization. Experiments uncover correlations between noise types and model failure modes, establishing a unified evaluation protocol across text-only, boundary-only, and mixed-noise settings. Results demonstrate that our method significantly improves foreground-background trade-off performance, consistently outperforming existing label-noise learning approaches under diverse noise conditions.

Technology Category

Application Category

📝 Abstract
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
Problem

Research questions and friction points this paper is trying to address.

Addresses action-based video object segmentation under label noise conditions
Tackles textual prompt noise and mask annotation noise in segmentation
Establishes benchmark for evaluating robustness against multimodal supervision noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing label noise types for action-based video segmentation
Creating ActiSeg-NL benchmark with noise evaluation protocols
Proposing Parallel Mask Head Mechanism for boundary noise
🔎 Similar Papers
No similar papers found.