World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models suffer significant performance degradation when generating novel or out-of-distribution (OOD) entities due to knowledge cutoff and insufficient semantic alignment. To address this, we propose World-To-Image—a framework that employs a web-search agent to dynamically retrieve relevant online images and integrates multimodal prompt optimization for real-time external knowledge injection and prompt enhancement. Our method requires no model fine-tuning and achieves knowledge-guided generation in only 2.7 iterations on average. On the NICE benchmark, it improves semantic accuracy by 8.1% over state-of-the-art approaches; it also achieves superior semantic consistency and visual aesthetic quality under LLMGrader and ImageReward evaluations. The core contribution is a scalable, low-overhead “retrieve–optimize” closed loop—marking the first integration of embodied agents into T2I generation to tackle OOD challenges.

Technology Category

Application Category

📝 Abstract
While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available herefootnote{https://github.com/mhson-kyle/World-To-Image}.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance degradation with novel entities in text-to-image generation
Bridging knowledge gaps using agent-driven web search for T2I models
Improving semantic alignment and visual aesthetics in generated images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-driven web search retrieves unknown concept images
Multimodal prompt optimization steers generative backbones
Framework achieves high efficiency in under three iterations
🔎 Similar Papers
2024-06-09arXiv.orgCitations: 3
M
Moo Hyun Son
The Hong Kong University of Science and Technology
J
Jintaek Oh
The Hong Kong University of Science and Technology
S
Sun Bin Mun
Georgia Institute of Technology
Jaechul Roh
Jaechul Roh
University of Massachusetts Amherst
ML PrivacyML SecurityAudio AI SafetyAgent Safety
Sehyun Choi
Sehyun Choi
TwelveLabs
Machine LearningArtificial IntelligenceLLMVideo Language ModelMulti-Modal AI