🤖 AI Summary
To address the low facial realism and high computational cost of distilled diffusion models (e.g., FLUX.1-schnell) in portrait generation, this paper proposes a “Synthetic Paired Distillation Enhancement” paradigm. We first empirically verify that distortion patterns between distilled models and their baselines exhibit domain-level consistency specifically for human faces. Leveraging this insight, we construct a fully synthetic paired dataset and train a lightweight U-Net-based image-to-image enhancement module to perform unsupervised post-hoc refinement of distilled outputs. Crucially, our method requires no real-image annotations or fine-tuning of the backbone diffusion model, significantly lowering deployment barriers. On portrait generation tasks, enhanced outputs achieve visual quality comparable to FLUX.1-dev while reducing inference latency by 82%. This yields substantial improvements in cost-efficiency for large-scale AI image generation.
📝 Abstract
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.