๐ค AI Summary
This work addresses the challenge of poor generalization in robotic imitation learning when deploying policies across heterogeneous tasks, often caused by averaging multimodal demonstrations. To overcome this, the authors propose LAR-MoE, a two-stage framework that first constructs a joint latent space between observations and future actions via teacher-student co-training, then leverages this latent representation to guide unsupervised expert routing in a Mixture-of-Experts (MoE) architecture for skill decomposition and efficient policy learning. Innovatively integrating latent space alignment into the MoE routing mechanism enables expert specialization without supervision, preventing collapse while maintaining parameter efficiency. Experiments demonstrate that LAR-MoE achieves a 95.2% average success rate on the LIBERO benchmark with only 150 million parameters and enables zero-shot transfer to ex vivo porcine tissue in a surgical bowel-grasping task, matching the performance of supervised MoE approaches.
๐ Abstract
Imitation learning enables robots to acquire manipulation skills from demonstrations, yet deploying a policy across tasks with heterogeneous dynamics remains challenging, as models tend to average over distinct behavioral modes present in the demonstrations. Mixture-of-Experts (MoE) architectures address this by activating specialized subnetworks, but requires meaningful skill decompositions for expert routing. We introduce Latent-Aligned Routing for Mixture of Experts (LAR-MoE), a two-stage framework that decouples unsupervised skill discovery from policy learning. In pre-training, we learn a joint latent representation between observations and future actions through student-teacher co-training. In a post-training stage, the expert routing is regularized to follow the structure of the learned latent space, preventing expert collapse while maintaining parameter efficiency. We evaluate LAR-MoE in simulation and on hardware. On the LIBERO benchmark, our method achieves a 95.2% average success rate with 150M parameters. On a surgical bowel grasping and retraction task, LAR-MoE matches a supervised MoE baseline without requiring any phase annotations, and transfers zero-shot to ex vivo porcine tissue. Our findings suggest that latent-aligned routing provides a principled alternative to supervised skill decomposition, enabling structured expert specialization from unlabeled demonstrations.