SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing vision-language-action models struggle to accurately localize small target regions when performing novel tasks from a single demonstration video, resulting in low success rates. This work proposes SeeTraceAct, a framework that enhances spatial understanding and alignment under one-shot demonstration conditions by incorporating visibility-aware end-effector trajectory prediction and latent-space planning. To support reproducible evaluation, we introduce RoboCasa-DC, the first cross-embodiment paired demonstration dataset. Experimental results demonstrate that SeeTraceAct achieves state-of-the-art performance across all four settings of RoboCasa-DC and improves average task success rate by 12.5 percentage points on a real Franka Panda robot.

📝 Abstract

Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot demo-conditioned VLAs, where a robot policy is conditioned on a single demonstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful execution requires precisely localizing small target regions. To address this limitation, we propose SeeTraceAct, a demo-conditioned VLA framework that encourages precise spatial grounding through visibility-aware prediction of future end-effector traces. To enable reproducible evaluation with cross-embodiment demonstrations, we introduce and release RoboCasa-DC, a demo-conditioned extension of RoboCasa with episode-paired humanoid videos. Experiments on RoboCasa-DC and a real-world benchmark, where a Franka Panda arm is conditioned on human demonstrations, show that SeeTraceAct outperforms baselines, achieving the best success rate across all four RoboCasa-DC settings and improving real-world average success by 12.5 percentage points.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

one-shot demonstration

spatial grounding

cross-embodiment

robot policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

visibility-aware planning

demo-conditioned VLA

cross-embodiment demonstration