Open Multimodal Retrieval-Augmented Factual Image Generation

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large multimodal models (LMMs) frequently violate verifiable facts in image generation—especially regarding fine-grained attributes and time-sensitive scenarios. To address this, we introduce *Factual Image Generation* (FIG), a novel task requiring generated images to faithfully reflect real-world knowledge. We propose ORIG, an open retrieval-augmented framework that iteratively retrieves multimodal evidence from the web, dynamically filters relevant evidence, and progressively integrates it into the generation process—thereby injecting up-to-date, trustworthy multimodal information. Furthermore, we construct FIG-Eval, the first benchmark for FIG, covering perceptual, compositional, and temporal dimensions of factual consistency. Extensive experiments demonstrate that ORIG significantly improves both factual consistency and visual quality over strong baselines. Our work provides the first systematic empirical validation of open retrieval’s critical role in enhancing factual grounding for image generation.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.

Problem

Research questions and friction points this paper is trying to address.

Addresses factual inconsistencies in multimodal model image generation

Overcomes limitations of static knowledge sources in retrieval-augmented systems

Ensures visual realism and factual grounding through iterative evidence integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open multimodal retrieval framework for factual image generation

Iteratively retrieves and filters web evidence for grounding

Incrementally integrates refined knowledge into enriched prompts

🔎 Similar Papers

No similar papers found.

Authors to Follow