TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image generation methods exhibit poor performance on long-text rendering, primarily due to the absence of high-quality, high-density long-text image benchmarks. Method: We introduce TextAtlas5M—a large-scale synthetic benchmark comprising 5 million images—and TextAtlasEval, a human-annotated, cross-domain evaluation set of 3,000 samples covering real-world scenarios including advertisements, signage, and infographics. Our data construction paradigm integrates multi-source automated synthesis with OCR-based verification, semantic consistency checking, and human re-annotation. Contribution/Results: Experiments reveal that state-of-the-art closed-source models (e.g., GPT-4o + DALL·E-3) achieve only ~42% text accuracy on TextAtlasEval, while open-source models typically score below 20%. This work establishes the first dedicated, multi-source, high-density, high-fidelity benchmark explicitly designed for evaluating long-text image generation—setting a new standard for the field.

Technology Category

Application Category

📝 Abstract
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
Problem

Research questions and friction points this paper is trying to address.

Challenges in generating images with long-form text.
Limitations of existing datasets for text-conditioned image generation.
Need for comprehensive evaluation of long-text rendering models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for text-image generation
Focus on long-form text rendering
Human-improved test set for evaluation
🔎 Similar Papers
No similar papers found.
A
Alex Jinpeng Wang
Central South University
D
Dongxing Mao
Central South University
J
Jiawei Zhang
North University of China
W
Weiming Han
North University of China
Z
Zhuobai Dong
Central South University
Linjie Li
Linjie Li
Microsoft
Vision and Language
Yiqi Lin
Yiqi Lin
National University of Singapore
Zhengyuan Yang
Zhengyuan Yang
Principal Researcher, Microsoft
Computer VisionMultimediaMultimodalPost-TrainingAgentic RL
L
Libo Qin
Central South University
F
Fuwei Zhang
North University of China
L
Lijuan Wang
Microsoft
M
Min Li
Central South University