Learning to play: A Multimodal Agent for 3D Game-Play

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time multimodal reasoning in 3D first-person video games lacks text-driven agents capable of cross-game generalization. Method: We propose the first text-conditioned behavioral cloning framework for cross-game generalization, comprising: (1) a large-scale, diverse 3D gameplay dataset with fine-grained natural language instruction annotations; (2) an inverse dynamics model that synthesizes pseudo-action sequences from unlabeled video data, substantially expanding training coverage; and (3) a lightweight, custom neural architecture enabling end-to-end, real-time text-to-action inference on consumer-grade GPUs. Results: Our agent accurately interprets natural language instructions and generates semantically appropriate actions across multiple unseen 3D games. It achieves, for the first time, low-latency, closed-loop, cross-game multimodal interaction—establishing a scalable training paradigm and benchmark resource for embodied AI.

Technology Category

Application Category

📝 Abstract
We argue that 3-D first-person video games are a challenging environment for real-time multi-modal reasoning. We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games, which is both substantially larger and more diverse compared to prior publicly disclosed datasets, and contains text instructions. We demonstrate that we can learn an inverse dynamics model from this dataset, which allows us to impute actions on a much larger dataset of publicly available videos of human game play that lack recorded actions. We then train a text-conditioned agent for game playing using behavior cloning, with a custom architecture capable of realtime inference on a consumer GPU. We show the resulting model is capable of playing a variety of 3-D games and responding to text input. Finally, we outline some of the remaining challenges such as long-horizon tasks and quantitative evaluation across a large set of games.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agents for 3D game-play
Learning inverse dynamics models from human gameplay data
Training real-time text-conditioned agents via behavior cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse dynamics model imputes actions on videos
Behavior cloning trains text-conditioned game agent
Custom architecture enables realtime GPU inference
🔎 Similar Papers
No similar papers found.