RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models (e.g., GPT-Image-1, Qwen-Image) suffer from prominent AI artifacts—such as oversmoothed skin and unrealistic facial specular highlights—hindering photorealistic fidelity. To address this, we propose Detector Reward, a novel reward mechanism that jointly quantifies semantic and feature-level artifacts via multi-granularity image detectors, enabling end-to-end reward-driven optimization. Our framework unifies LLM-based prompt refinement, diffusion-based image generation, and GRPO-based reinforcement learning for joint training. Furthermore, we introduce RealBench—the first automated benchmark for photorealism evaluation—designed to assess perceptual realism, detail fidelity, and aesthetic quality. Extensive experiments demonstrate that our method consistently outperforms GPT-Image-1, Qwen-Image, and FLUX-Krea across all dimensions. Notably, RealBench scores exhibit strong correlation with human perceptual judgments, validating its effectiveness as an objective realism metric.

Technology Category

Application Category

📝 Abstract
With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce " fake" images with distinct AI artifacts, often characterized by "overly smooth skin" and "oily facial sheens". To recapture the original goal of "indistinguishable-from-reality" generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a "Detector Reward" mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at https://github.com/yejy53/RealGen.
Problem

Research questions and friction points this paper is trying to address.

Addresses AI artifacts in text-to-image generation
Enhances photorealism using detector-guided reward mechanisms
Proposes automated benchmark for realism evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM for prompt optimization and diffusion model
Introduces Detector Reward mechanism for artifact quantification
Uses GRPO algorithm to optimize generation pipeline realism
Junyan Ye
Junyan Ye
SYSU
Computer Vision and Deep Learning
L
Leiqi Zhu
Shanghai AI Lab
Y
Yuncheng Guo
Shanghai AI Lab
Dongzhi Jiang
Dongzhi Jiang
MMLab, CUHK
Zilong Huang
Zilong Huang
ByteDance Inc.
Multi-modal LearningComputer Vision
Y
Yifan Zhang
Tsinghua University
Z
Zhiyuan Yan
Peking University
Haohuan Fu
Haohuan Fu
Tsinghua University
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
W
Weijia Li
Sun Yat-Sen University