🤖 AI Summary
This paper addresses the problem of generating and naturally compositing human portraits under scene constraints. We propose a three-stage cascaded generation framework: first, a Wasserstein GAN generates a semantically consistent initial human layout conditioned on the global scene skeletal structure; second, a lightweight linear skeleton refinement network enhances geometric consistency and contextual coherence; third, an image-conditional generative model synthesizes high-fidelity portraits. Our method is the first to employ scene-level skeletal structure as a cross-scale guiding signal, enabling joint modeling of position, pose, and scale. Extensive quantitative evaluations demonstrate significant improvements over state-of-the-art methods: +18.7% in pose plausibility (PCKh) and −23.4% in generation quality (FID).
📝 Abstract
Person image generation is an intriguing yet challenging problem. However, this task becomes even more difficult under constrained situations. In this work, we propose a novel pipeline to generate and insert contextually relevant person images into an existing scene while preserving the global semantics. More specifically, we aim to insert a person such that the location, pose, and scale of the person being inserted blends in with the existing persons in the scene. Our method uses three individual networks in a sequential pipeline. At first, we predict the potential location and the skeletal structure of the new person by conditioning a Wasserstein Generative Adversarial Network (WGAN) on the existing human skeletons present in the scene. Next, the predicted skeleton is refined through a shallow linear network to achieve higher structural accuracy in the generated image. Finally, the target image is generated from the refined skeleton using another generative network conditioned on a given image of the target person. In our experiments, we achieve high-resolution photo-realistic generation results while preserving the general context of the scene. We conclude our paper with multiple qualitative and quantitative benchmarks on the results.