Proc4Gem: Foundation models for physical agency through procedural generation

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding decoupling of physical interaction and semantic understanding in embodied intelligence. Methodologically, it introduces a simulation-to-reality co-modeling paradigm that jointly integrates contact dynamics and environmental semantics. Specifically, it employs procedural generation to synthesize high-fidelity, contact-rich, semantically diverse physical trajectories; combines rigid-body simulation with rasterization-based rendering; and leverages multimodal instruction tuning to establish a physically grounded, semantically aligned training pipeline. Crucially, it demonstrates—for the first time—the zero-shot transfer of a purely simulation-trained multimodal foundation model (Gemini) to real-world contact manipulation on a quadruped robot, enabling language-guided actions such as pushing unseen objects to novel target locations. Experiments show that the model achieves complex physical interaction tasks without any real-world fine-tuning, validating that simulation-driven training can endow foundation models with genuine physical agency in the real world.

Technology Category

Application Category

📝 Abstract
In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency. Videos can be found at our website: https://sites.google.com/view/proc4gem
Problem

Research questions and friction points this paper is trying to address.

Integrate environment semantics and contact dynamics in robot learning.
Use generative modeling and procedural generation for realistic simulations.
Transfer simulation-trained behaviors to real-world robotic tasks via language instructions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative modeling for contact-rich trajectories
Procedural generation in diverse simulations
Language-instructed real-world robot control
🔎 Similar Papers
No similar papers found.
Yixin Lin
Yixin Lin
DeepMind
Jan Humplik
Jan Humplik
Research Scientist, DeepMind
AIRobotics
S
Sandy H. Huang
Google DeepMind
Leonard Hasenclever
Leonard Hasenclever
Research Scientist at DeepMind
Machine LearningReinforcement LearningStatistics
F
Francesco Romano
Google DeepMind
S
Stefano Saliceti
Google DeepMind
D
Daniel Zheng
Google DeepMind
J
José Enrique Chen
Google DeepMind
C
Catarina Barros
Google DeepMind
A
Adrian Collister
Google DeepMind
M
Matt Young
Google DeepMind
A
Adil Dostmohamed
Google DeepMind
B
Ben Moran
Google DeepMind
Ken Caluwaerts
Ken Caluwaerts
Google DeepMind
RoboticsReinforcement LearningMachine LearningLegged LocomotionTensegrity
M
M. Giustina
Google DeepMind
J
Joss Moore
Google DeepMind
K
Kieran Connell
Google DeepMind
F
Francesco Nori
Google DeepMind
N
N. Heess
Google DeepMind
Steven Bohez
Steven Bohez
Google DeepMind
deep learningreinforcement learningrobotics
Arunkumar Byravan
Arunkumar Byravan
Google DeepMind