AlignedGen: Aligning Style Across Generated Images

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models exhibit poor style consistency when generating images under identical style prompts, and existing training-free methods—constrained by U-Net architectures—struggle to adapt to high-performance Diffusion Transformers (DiTs), often introducing artifacts and degrading text–image alignment. This work first identifies positional encoding conflict as the root cause of naive attention sharing failure in DiTs, and proposes Shifted Position Embedding to mitigate it. We further introduce Advanced Attention Sharing (AAS), enabling external reference images to guide style transfer. Coupled with efficient Q/K/V feature extraction, our framework achieves cross-image style alignment on DiTs. Experiments demonstrate substantial improvements in style consistency, elimination of artifacts (e.g., object duplication), full compatibility with mainstream DiT variants (e.g., DiT-S/XXL), and preservation of both generation quality and text fidelity.

Technology Category

Application Category

📝 Abstract
Despite their generative power, diffusion models struggle to maintain style consistency across images conditioned on the same style prompt, hindering their practical deployment in creative workflows. While several training-free methods attempt to solve this, they are constrained to the U-Net architecture, which not only leads to low-quality results and artifacts like object repetition but also renders them incompatible with superior Diffusion Transformer (DiT). To address these issues, we introduce AlignedGen, a novel training-free framework that enhances style consistency across images generated by DiT models. Our work first reveals a critical insight: naive attention sharing fails in DiT due to conflicting positional signals from improper position embeddings. We introduce Shifted Position Embedding (ShiftPE), an effective solution that resolves this conflict by allocating a non-overlapping set of positional indices to each image. Building on this foundation, we develop Advanced Attention Sharing (AAS), a suite of three techniques meticulously designed to fully unleash the potential of attention sharing within the DiT. Furthermore, to broaden the applicability of our method, we present an efficient query, key, and value feature extraction algorithm, enabling our method to seamlessly incorporate external images as style references. Extensive experimental results validate that our method effectively enhances style consistency across generated images while maintaining precise text-to-image alignment.
Problem

Research questions and friction points this paper is trying to address.

Diffusion models struggle with style consistency across images
Existing methods cause artifacts and are incompatible with DiT
Position embedding conflicts prevent effective attention sharing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifted Position Embedding resolves positional conflicts
Advanced Attention Sharing unleashes DiT attention potential
Efficient feature extraction enables external style references
🔎 Similar Papers
No similar papers found.
J
Jiexuan Zhang
School of Electronic and Computer Engineering, Peking University
Yiheng Du
Yiheng Du
UC Berkeley
Machine LearningAI for Science
Q
Qian Wang
School of Electronic and Computer Engineering, Peking University
W
Weiqi Li
School of Electronic and Computer Engineering, Peking University
Y
Yu Gu
School of Electronic and Computer Engineering, Peking University
J
Jian Zhang
School of Electronic and Computer Engineering, Peking University