🤖 AI Summary
Text-to-image diffusion models face significant bottlenecks in generating complex, multi-object prompts and achieving diverse stylistic outputs.
Method: We propose the Multi-Expert Planning and Generation (MEPG) framework, which employs a location- and style-aware large language model (LLM) for fine-grained semantic instruction decomposition, coupled with spatial-semantic expert modules for joint layout planning and style synthesis. A novel dynamic expert routing mechanism and attention-based gating enable localized personalized generation while preserving global coherence. The method integrates LLM fine-tuning, a multi-expert diffusion architecture, and cross-region generation techniques, supporting high scalability and interactive editing.
Contribution/Results: Extensive experiments demonstrate that, under identical backbone models, MEPG substantially improves structural accuracy and stylistic diversity of generated images, consistently outperforming state-of-the-art baselines across quantitative and qualitative evaluations.
📝 Abstract
Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality
and style diversity.