🤖 AI Summary
To address the lack of pose and illumination controllability in natural-language-driven personalized virtual try-on for fashion e-commerce, this paper proposes the first end-to-end text-to-pose-to-relighting generation framework. Methodologically, it eliminates reliance on explicit pose annotations by employing text-guided 2D pose estimation for semantic alignment; integrates diffusion models to synthesize high-fidelity dressed images; and introduces a lightweight, learnable relighting module enabling photorealistic rendering under arbitrary illumination conditions. Experimental results demonstrate that the framework significantly outperforms existing methods in fine-grained pose generation, clothing detail preservation, and illumination consistency. It achieves superior visual quality and practical applicability for e-commerce scenarios, establishing new state-of-the-art performance in controllable virtual try-on.
📝 Abstract
Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.