🤖 AI Summary
Robots struggle to generalize skills across diverse scenes from a single demonstration, primarily due to the absence of transferable and interpretable task-space representations. To address this, we propose TReF-6, the first method to automatically construct a six-degree-of-freedom Task Reference Frame (TRF) from a single demonstration trajectory. TReF-6 identifies influence points via geometric trajectory analysis to define local coordinate origins, and integrates a vision-language model with Grounded-SAM for semantic grounding and scene-adaptive alignment. This TRF extends Dynamic Movement Primitives (DMPs) beyond conventional start–end point imitation to enable functionally consistent geometric–semantic transfer, preserving task intent during generalization. Experiments demonstrate robustness to trajectory noise in simulation and successful end-to-end deployment on a physical robot. The approach generalizes effectively across varied object configurations, significantly enhancing cross-scene manipulation performance from one-shot demonstrations.
📝 Abstract
Robots often struggle to generalize from a single demonstration due to the lack of a transferable and interpretable spatial representation. In this work, we introduce TReF-6, a method that infers a simplified, abstracted 6DoF Task-Relevant Frame from a single trajectory. Our approach identifies an influence point purely from the trajectory geometry to define the origin for a local frame, which serves as a reference for parameterizing a Dynamic Movement Primitive (DMP). This influence point captures the task's spatial structure, extending the standard DMP formulation beyond start-goal imitation. The inferred frame is semantically grounded via a vision-language model and localized in novel scenes by Grounded-SAM, enabling functionally consistent skill generalization. We validate TReF-6 in simulation and demonstrate robustness to trajectory noise. We further deploy an end-to-end pipeline on real-world manipulation tasks, showing that TReF-6 supports one-shot imitation learning that preserves task intent across diverse object configurations.