PreGenie: An Agentic Framework for High-quality Visual Presentation Generation

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing automated visual presentation generation methods suffer from disorganized layouts, misaligned text-image content, and inaccurate textual summarization, limiting their applicability in formal domains such as business and scientific communication. To address these challenges, we propose a novel two-stage multi-MLLM collaborative agent architecture built upon the Slidev framework. In Stage I, an analysis agent parses input documents and generates initial slide code; in Stage II, a review agent iteratively refines this code using visual rendering feedback. Our approach introduces, for the first time, a closed-loop code-rendering-feedback mechanism that integrates multimodal understanding, modular agent coordination, and end-to-end rendering. Experiments demonstrate that our method significantly outperforms state-of-the-art baselines in aesthetic quality, text-image consistency, and alignment with human design preferences—achieving strong practical utility and cross-scenario adaptability.

Technology Category

Application Category

📝 Abstract

Visual presentations are vital for effective communication. Early attempts to automate their creation using deep learning often faced issues such as poorly organized layouts, inaccurate text summarization, and a lack of image understanding, leading to mismatched visuals and text. These limitations restrict their application in formal contexts like business and scientific research. To address these challenges, we propose PreGenie, an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations. PreGenie is built on the Slidev presentation framework, where slides are rendered from Markdown code. It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations. Each stage leverages multiple MLLMs that collaborate and share information. Comprehensive experiments demonstrate that PreGenie excels in multimodal understanding, outperforming existing models in both aesthetics and content consistency, while aligning more closely with human design preferences.

Problem

Research questions and friction points this paper is trying to address.

Automating high-quality visual presentation generation

Addressing poor layout and text-visual mismatch

Enhancing aesthetics and content consistency in presentations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic modular framework with MLLMs

Two-stage generation with iterative review

Multimodal understanding for design consistency

🔎 Similar Papers

No similar papers found.