LLMs Behind the Scenes: Enabling Narrative Scene Illustration

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the cross-modal generation of visual illustrations from narrative text. We propose an LLM-driven collaborative generation framework: first, a large language model explicitly extracts and structures implicit scene knowledge from stories to produce high-fidelity, temporally coherent image prompts; second, these prompts are fed into text-to-image models for illustration synthesis. To support evaluation and modeling, we introduce SceneIllustrations—the first benchmark dataset tailored for narrative cross-modal translation—featuring human-annotated quality pairs and a human-feedback-based illustration assessment protocol. Experiments demonstrate that our method significantly improves alignment between generated images and narrative content, as well as visual expressiveness, in large-scale story scene illustration generation. The SceneIllustrations dataset is publicly released to advance research in narrative intelligence and cross-modal generation.

Technology Category

Application Category

📝 Abstract

Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.

Problem

Research questions and friction points this paper is trying to address.

Automatically generating images from story text

Using LLMs to prompt text-to-image models

Evaluating illustration quality for narrative scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs prompt text-to-image models

Pipeline generates illustrations from stories

Dataset enables cross-modal narrative transformation

🔎 Similar Papers

No similar papers found.