🤖 AI Summary
This work addresses the limited cross-domain reusability of existing sparse 3D encoders, which are tightly coupled to specific task objectives, data distributions, policy architectures, and action parameterizations. To overcome this, the authors propose Sparse2Act, a novel framework that leverages end-effector actions as compact geometric supervision signals. By pretraining with masked sparse 3D token modeling aligned to task-space actions, Sparse2Act learns transferable 3D representations decoupled from downstream policies. The method achieves an average success rate of 86.9% on LIBERO-10 after only 500 fine-tuning steps, and demonstrates strong generalization by transferring effectively to Meta-World-5 (73.4%) and four real-world robotic tasks (72.5%), significantly advancing cross-domain and sim-to-real transfer capabilities.
📝 Abstract
Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.