Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-source image generation models suffer from insufficient coverage of rare scenarios and weak text–image alignment. Method: To address these limitations, we leverage GPT-4o to generate high-fidelity synthetic images, constructing Echo-4o-Image—a 180K-scale dataset that mitigates long-tail scarcity and annotation noise in real-world data. We employ knowledge distillation to fine-tune multimodal foundation models (e.g., Bagel) and introduce two challenging new benchmarks—GenEval++ and Imagine-Bench—to rigorously evaluate controllability and compositional generalization. Results: Echo-4o-Image consistently improves performance across state-of-the-art models including OmniGen2 and BLIP3-o, demonstrating strong generalization and transferability. This work provides the first systematic evidence that controllable synthetic data is indispensable for compensating inherent deficiencies in real-world datasets, establishing a novel paradigm for data curation in open-source image generation.

Technology Category

Application Category

📝 Abstract
Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing open-source image generation with GPT-4o synthetic data
Addressing rare scenarios and noise in real-world image datasets
Improving text-to-image alignment through clean synthetic supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset Echo-4o-Image from GPT-4o
Fine-tuned Bagel baseline with multimodal generation
Proposed GenEval++ and Imagine-Bench evaluation benchmarks
Junyan Ye
Junyan Ye
SYSU
Computer Vision and Deep Learning
Dongzhi Jiang
Dongzhi Jiang
MMLab, CUHK
Z
Zihao Wang
Sun Yat-sen University
L
Leqi Zhu
Shanghai Artificial Intelligence Laboratory
Zhenghao Hu
Zhenghao Hu
Sun Yat-Sen University
Remote Sensing3D Building ReconstructionDeep Learning
Zilong Huang
Zilong Huang
ByteDance Inc.
Multi-modal LearningComputer Vision
J
Jun He
Sun Yat-sen University
Z
Zhiyuan Yan
Peking University
J
Jinghua Yu
Sun Yat-sen University
H
Hongsheng Li
CUHK MMLab
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
W
Weijia Li
Sun Yat-sen University