MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
Current text-to-image generation models struggle with implicit visual constraints, relational reasoning, and prompts requiring external knowledge, while also failing to effectively leverage historical generation experience. This work proposes MemoGen—a training-free, test-time self-evolution framework that introduces, for the first time, a reusable experience memory mechanism into image generation. By employing an agent evolution layer, MemoGen establishes a closed-loop pipeline that integrates task comprehension, external evidence retrieval, constraint formulation, result evaluation, and memory storage, enabling continuous refinement of generation strategies without updating model parameters. Built upon the Qwen-Image backbone, MemoGen surpasses strong baselines such as Nano Banana Pro and GPT-Image-1 on knowledge- and reasoning-intensive benchmarks like WISE and Mind-Bench after only two rounds of evolution.
📝 Abstract
Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
experience memory
relational reasoning
external knowledge
continual learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

experience memory
retrieval-augmented generation
test-time self-evolution
training-free framework
agentic evolution
🔎 Similar Papers
No similar papers found.
Wenshuo Chen
Wenshuo Chen
Shandong University undergraduate student
Generative ModelsXAI
K
Kuimou Yu
The Hong Kong University of Science and Technology (Guangzhou)
Bowen Tian
Bowen Tian
The Hong Kong University of Science and Technology (Guangzhou)
Model FusionNeural Network FunctionalsSemi-Supervised Learning
J
Jianfei Song
LimX Dynamics Technology Co., Ltd.
S
Shaofeng Liang
The Hong Kong University of Science and Technology (Guangzhou)
H
Haozhe Jia
The Hong Kong University of Science and Technology (Guangzhou)
K
Kan Cheng
The Hong Kong University of Science and Technology (Guangzhou); Shandong University
H
Haosen Li
The Hong Kong University of Science and Technology (Guangzhou)
K
Kaishen Yuan
The Hong Kong University of Science and Technology (Guangzhou)
Lei Wang
Lei Wang
Griffith University, Data61/CSIRO
Action RecognitionComputer VisionMachine LearningDeep LearningPattern Recognition
Jiemin Wu
Jiemin Wu
The Hong Kong University of Science and Technology (Guangzhou)
Nonlinear Dynamic SystemsApproximate Inference OptimizationLarge Language Models
Songning Lai
Songning Lai
HKUST(GZ)
Machine LearningDeep LearningMultimodalXAI
Y
Yutao Yue
The Hong Kong University of Science and Technology (Guangzhou); Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)