From Zero to Hero: Training-Free Custom Concept Spawning in World Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing autoregressive world models struggle to incorporate user-specified visual concepts on demand in interactive navigation, leading to reliance solely on model priors for unseen regions and a lack of controllable scene construction. This work proposes SPAWN, the first method capable of precisely injecting and propagating arbitrary visual concepts—specified via images or text and spanning multiscale entities such as characters and landmarks—into pre-trained models without any additional training. By leveraging a windowed latent injection mechanism and a first-slot anchor swapping strategy within the context memory of image-to-video backbone models, SPAWN embeds external concepts into the generative process while preserving consistency in lighting, scale, and viewpoint, as well as ensuring spatiotemporal coherence. Experiments demonstrate the feasibility and effectiveness of training-free concept injection for interactive video generation.

📝 Abstract

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

Problem

Research questions and friction points this paper is trying to address.

concept spawning

world models

interactive video generation

controllable scene composition

autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept spawning

training-free

autoregressive world models