Infinite-Story: A Training-Free Consistent Text-to-Image Generation

πŸ“… 2025-11-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address identity and style inconsistency in text-to-image generation across multiple prompts, this paper proposes a training-free, inference-time consistency control framework. The method employs a scale-autoregressive architecture and introduces a novel identity prompt replacement mechanism to mitigate contextual bias in the text encoder; it further incorporates unified attention guidance and adaptive style injection modules to ensure cross-prompt identity preservation and style stability. All operations are performed solely during inference, requiring no model fine-tuning. Experiments demonstrate that the approach significantly improves image-text consistency while maintaining prompt fidelity, achieving state-of-the-art performance. It processes images at 1.72 seconds per sampleβ€”over six times faster than the current fastest method.

Technology Category

Application Category

πŸ“ Abstract
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.
Problem

Research questions and friction points this paper is trying to address.

Addresses identity inconsistency in multi-prompt text-to-image storytelling
Solves style inconsistency across different prompts in visual generation
Eliminates need for fine-tuning while maintaining prompt fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for consistent text-to-image generation
Identity Prompt Replacement mitigates context bias in encoders
Unified attention guidance ensures global style consistency
πŸ”Ž Similar Papers
No similar papers found.
Jihun Park
Jihun Park
DGIST, South Korea
K
Kyoungmin Lee
DGIST, South Korea
J
Jongmin Gim
DGIST, South Korea
H
Hyeonseo Jo
DGIST, South Korea
M
Minseok Oh
DGIST, South Korea
W
Wonhyeok Choi
DGIST, South Korea
K
Kyumin Hwang
DGIST, South Korea
J
Jaeyeul Kim
DGIST, South Korea
M
Minwoo Choi
DGIST, South Korea
Sunghoon Im
Sunghoon Im
EECS, DGIST
Computer VisionDeep LearningRobot Vision