🤖 AI Summary
Existing banner generation models achieve high visual fidelity but struggle to satisfy commercial design requirements—such as structured layout, precise typography, and brand consistency. To address this, we propose MIMO, an end-to-end generative framework leveraging multimodal agents and reflective optimization. MIMO establishes a hierarchical agent architecture integrating collaborative planning by multiple large language models (LLMs), multimodal understanding, diffusion-based image generation, and iterative reflective reasoning. Given only a natural-language prompt and a logo image, it automatically detects design flaws and optimizes layout, typography, and brand elements (e.g., color palette, logo placement). Evaluated on real-world advertising benchmarks, MIMO significantly outperforms state-of-the-art diffusion models and LLM-based baselines, achieving new SOTA performance in both visual quality and design compliance—including alignment, spacing, and brand-color consistency.
📝 Abstract
Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity -- they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.