🤖 AI Summary
This work addresses the cross-modal generation of visual illustrations from narrative text. We propose an LLM-driven collaborative generation framework: first, a large language model explicitly extracts and structures implicit scene knowledge from stories to produce high-fidelity, temporally coherent image prompts; second, these prompts are fed into text-to-image models for illustration synthesis. To support evaluation and modeling, we introduce SceneIllustrations—the first benchmark dataset tailored for narrative cross-modal translation—featuring human-annotated quality pairs and a human-feedback-based illustration assessment protocol. Experiments demonstrate that our method significantly improves alignment between generated images and narrative content, as well as visual expressiveness, in large-scale story scene illustration generation. The SceneIllustrations dataset is publicly released to advance research in narrative intelligence and cross-modal generation.
📝 Abstract
Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.