Teleportraits: Training-Free People Insertion into Any Scene

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of zero-shot, end-to-end person insertion—embedding a person from a reference image into an arbitrary background while preserving identity, clothing, pose, and anatomical structure. The method leverages a pre-trained text-to-image diffusion model and jointly models spatial layout, pose, and personalized appearance representations. It employs image inversion to extract the source person’s latent representation, integrates mask-guided self-attention for fine-grained identity fidelity and scene-aware accessibility, and utilizes classifier-free guidance to steer generation. Crucially, no model fine-tuning or auxiliary training data is required. Experiments demonstrate state-of-the-art performance across diverse backgrounds: seamless foreground-background integration, photorealistic edge coherence, consistent illumination, and complete identity preservation—surpassing prior approaches in both qualitative and quantitative evaluations.

Technology Category

Application Category

📝 Abstract
The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.
Problem

Research questions and friction points this paper is trying to address.

Inserting people realistically into any background scene
Achieving training-free human placement using diffusion models
Preserving subject identity and features from single reference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pipeline using pre-trained diffusion models
Mask-guided self-attention for identity preservation
Affordance-aware editing combining inversion with guidance
🔎 Similar Papers
No similar papers found.