🤖 AI Summary
Existing text-to-image generation methods primarily target single-turn tasks, lacking support for multi-turn iterative creative editing and suffering from intent drift and editing discontinuity. To address this, we propose the first multi-agent collaborative framework for multi-turn text-guided image generation and editing. Our approach decomposes complex editing tasks, assigns specialized roles (e.g., Intent Analyst, Edit Executor, Consistency Evaluator), and employs a multi-perspective feedback mechanism to ensure progressive alignment with user intent and continuous image refinement. It integrates dialogue-history-aware intent modeling, structured task orchestration, and tri-dimensional evaluation—spanning semantic, visual, and temporal coherence. Experiments demonstrate that our system significantly outperforms state-of-the-art single-agent conversational methods in editing controllability, cross-turn consistency, and user satisfaction. This work establishes a scalable, cooperative paradigm for interactive AI-generated content.
📝 Abstract
Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.