SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing vision-language-action models struggle to accurately localize small target regions when performing novel tasks from a single demonstration video, resulting in low success rates. This work proposes SeeTraceAct, a framework that enhances spatial understanding and alignment under one-shot demonstration conditions by incorporating visibility-aware end-effector trajectory prediction and latent-space planning. To support reproducible evaluation, we introduce RoboCasa-DC, the first cross-embodiment paired demonstration dataset. Experimental results demonstrate that SeeTraceAct achieves state-of-the-art performance across all four settings of RoboCasa-DC and improves average task success rate by 12.5 percentage points on a real Franka Panda robot.
📝 Abstract
Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot demo-conditioned VLAs, where a robot policy is conditioned on a single demonstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful execution requires precisely localizing small target regions. To address this limitation, we propose SeeTraceAct, a demo-conditioned VLA framework that encourages precise spatial grounding through visibility-aware prediction of future end-effector traces. To enable reproducible evaluation with cross-embodiment demonstrations, we introduce and release RoboCasa-DC, a demo-conditioned extension of RoboCasa with episode-paired humanoid videos. Experiments on RoboCasa-DC and a real-world benchmark, where a Franka Panda arm is conditioned on human demonstrations, show that SeeTraceAct outperforms baselines, achieving the best success rate across all four RoboCasa-DC settings and improving real-world average success by 12.5 percentage points.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action models
one-shot demonstration
spatial grounding
cross-embodiment
robot policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

visibility-aware planning
demo-conditioned VLA
cross-embodiment demonstration
end-effector trace prediction
RoboCasa-DC
J
Jaehyeon Son
Georgia Institute of Technology
J
Junhyun Kim
Georgia Institute of Technology
K
Kyle Kam
Georgia Institute of Technology
J
Jeremiah Coholich
Georgia Institute of Technology
S
Seok Joon Kim
Georgia Institute of Technology
J
Jinhoo Kim
Georgia Institute of Technology
Chris Dongjoo Kim
Chris Dongjoo Kim
Ai2
Machine LearningData QualityMultimodal dataReal-time Post-Training
Jaemin Cho
Jaemin Cho
PhD Student at UNC Chapel Hill
Multimodal LearningNatural Language ProcessingMachine Learning
Dieter Fox
Dieter Fox
University of Washington and AI2
RoboticsArtificial IntelligenceComputer Vision
Zsolt Kira
Zsolt Kira
Associate Professor, Georgia Institute of Technology
Machine LearningPerceptionRoboticsArtificial Intelligence