🤖 AI Summary
This work addresses the challenges of high-cost teleoperation data dependency and sim-to-real transfer difficulties in humanoid robot loco-manipulation tasks by introducing LEGS, a hybrid simulation framework. LEGS pioneers the integration of 3D Gaussian Splatting (3DGS) into humanoid robot training, decoupling demonstration data from scene appearance by combining procedurally generated foreground motion primitives with 3DGS-reconstructed photorealistic backgrounds. A two-stage color calibration aligns rendered and real images, enabling low-cost re-rendering to enhance policy generalization. Remarkably, vision-language-action (VLA) policies fine-tuned exclusively on LEGS-synthesized data match or surpass those trained on teleoperation data on three pick-and-place tasks using the Unitree G1 robot. Furthermore, the LEGS-trained policies maintain robust performance under joint variations in scene layout and object appearance, whereas teleoperation-based baselines fail completely.
📝 Abstract
Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.