🤖 AI Summary
Existing robotic action embedding spaces often lack structure, interpretability, and cross-embodiment transferability, remaining largely confined to specific tasks and platforms. This work proposes a phase-anchored disentangled representation framework that models motion periodicity through an FFT-parameterized phase manifold, complemented by an aperiodic pose branch and motion semantic distillation to construct a unified action embedding space. For the first time, this approach enables a human-motion-pretrained, general-purpose representation applicable across diverse humanoid robots. The resulting embeddings preserve intrinsic interpretability while significantly improving cross-platform action retrieval performance and consistently enhancing a range of downstream tasks.
📝 Abstract
Learning a good action embedding space is fundamental to scalable robot policy learning, yet existing methods treat action latents as task-specific intermediates rather than first-class representations. The resulting latents are unstructured, embodiment-specific, and weakly tied to motion semantics, limiting interpretability, controllability, and transferability across robots. We position the action embedding space itself as a first-class design target, with downstream policy quality emerging from representation quality. Exploiting motion's intrinsic periodicity, we factorize it into a phase manifold that captures cyclic structure via FFT-parametric coefficients, together with a pose branch that conditions the manifold on non-periodic configuration detail. Combined with motion-semantic distillation, this factorized structure yields a cross-embodiment motion manifold that is interpretable and embodiment-agnostic by design. Anchoring multiple humanoid robots to a shared human-pretrained manifold then produces a unified action embedding space across diverse platforms, achieving strong cross-embodiment retrieval and consistent gains on downstream robot tasks.