StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

๐Ÿ“… 2026-02-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of generating multi-frame, action-rich visual narratives in a zero-shot setting, where maintaining action semantic fidelity, subject identity consistency, and cross-frame background continuity simultaneously remains difficult. The authors propose an efficient pipeline that, given only a long textual prompt, a subject reference image, and bounding boxes, produces temporally coherent and identity-stable image sequences on a single RTX 4090 GPU. The method innovatively integrates three techniques: Gaussian-Centered Attention (GCA) to mitigate interference from overlapping bounding boxes, Action-Boosted Singular Value Reweighting (AB-SVR) to enhance action semantics, and Selective Forgetting Cache (SFC) to establish cross-frame semantic associations. Experiments show a 10โ€“15% improvement on the CLIP-T metric, superior DreamSim scores over strong baselines, competitive CLIP-I performance, and faster inference than FluxKontext, achieving both expressive visuals and stable scene progression.

Technology Category

Application Category

๐Ÿ“ Abstract
Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
Problem

Research questions and friction points this paper is trying to address.

action-rich visual narratives
subject identity fidelity
cross-frame background continuity
zero-shot generation
multi-frame coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Generation
Gaussian-Centered Attention
Action-Boost Singular Value Reweighting
Selective Forgetting Cache
Visual Narrative Coherence
๐Ÿ”Ž Similar Papers
J
Jinghao Hu
School of Computing, Northwest University, Xiโ€™an
Y
Yuhe Zhang
School of Computing, Northwest University, Xiโ€™an
G
GuoHua Geng
School of Computing, Northwest University, Xiโ€™an
K
Kang Li
School of Computing, Northwest University, Xiโ€™an
Han Zhang
Han Zhang
Associate Professor at School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
Control TheoryInverse Optimal ControlRehabilitation RobotsSLAM