🤖 AI Summary
Existing approaches predominantly adopt sequential text-to-visual generation, failing to simultaneously produce coherent textual narratives, dynamic scene graphs, visual imagery, and affective soundscapes; moreover, they suffer from insufficient cross-modal consistency in spatiotemporal structure, semantic relations, and emotional expression. This paper proposes a multimodal narrative co-generation framework: leveraging a large language model as the narrative engine, it integrates a dynamic scene graph management mechanism and a multimodal affective consistency control framework. A tripartite collaboration among a *narrator* module (text generation), a *director* module (scene graph and image synthesis), and an *affective controller* enables real-time, joint evolution of four modalities—text, scene graph, image, and soundscape—with tight spatiotemporal and affective alignment. Experiments demonstrate significant improvements over cascaded baselines in narrative depth, visual fidelity, and emotional resonance, enabling efficient creative prototyping and immersive storytelling across diverse genres.
📝 Abstract
We introduce Aether Weaver, a novel, integrated framework for multimodal narrative co-generation that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story's world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches. This integrated framework provides a robust platform for rapid creative prototyping and immersive storytelling experiences.