🤖 AI Summary
This work addresses the challenges of character identity inconsistency and visual style drift in story generation by proposing a two-stage optimization framework. First, it introduces a Group-Shared Attention (GSA) mechanism that enables lossless cross-frame identity modeling within the attention layers, preserving character consistency without requiring external encoders. Second, it integrates Direct Preference Optimization (DPO) to jointly enhance visual fidelity and narrative coherence through alignment with human preferences. Notably, this is the first approach to leverage holistic preference learning for the collaborative optimization of identity and style. Evaluated on the ViStoryBench benchmark, the method achieves a new state of the art, improving Character Identity Consistency (CIDS) by 10.0 and Character Style Consistency (CSD) by 18.7 while maintaining high generation quality.
📝 Abstract
Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.