🤖 AI Summary
Existing diffusion-based visuomotor policies erroneously model SE(3) poses as Euclidean vectors, leading to manifold drift, broken equivariance, and non-geodesic trajectories. To address these issues, this work proposes the Lie Diffuser Actor (LDA), which establishes an intrinsic diffusion process directly on the SE(3) manifold for the first time. LDA injects noise via a left-invariant stochastic differential equation, predicts scores in the tangent space, and maps back to the manifold using the exponential map. This approach rigorously preserves the manifold structure, ensuring coordinate equivariance and geodesic optimality of generated trajectories. Experiments demonstrate that LDA improves average task length by 7.3% (from 3.27 to 3.51) on the CALVIN ABC→D benchmark and significantly outperforms baseline methods in real-world robotic tasks.
📝 Abstract
Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.