OctoT2I: A Self-Evolving Agentic Text-to-Image Router

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the inherent trade-off between generation quality and inference efficiency in existing text-to-image synthesis methods, which often rely on handcrafted priors and single-path decision paradigms. The authors reformulate the task as a joint optimization problem balancing both objectives and introduce a stateful multi-round routing strategy. This approach leverages a self-constructed knowledge base, a state memory mechanism, and a “Propose–Solve–Evaluate–Learn” (PSEL) loop to enable unsupervised self-evolution: the system autonomously defines conceptual dimensions and dynamically routes to the optimal model configuration. Evaluated on GenEval, the method achieves a performance score of 0.96, outperforming the Flow-GRPO baseline by accelerating inference by 90.3% and improving energy efficiency by 56.6%, thereby striking a significantly better balance between performance and computational efficiency.
📝 Abstract
The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Image
Agentic T2I
inference efficiency
multi-model routing
human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Evolving Mechanism
Agentic Text-to-Image Routing
Multi-round Adaptive Routing
Conceptual Dimensions
PSEL Loop
🔎 Similar Papers
No similar papers found.