Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual generation models suffer from content entanglement in image editing, undermining editing consistency. To address this, we propose Qwen-Image-Layered, an end-to-end diffusion model that achieves, for the first time, learnable semantic decomposition of a single input image into disentangled RGBA layers—endowing images with intrinsic editability. Methodologically, we introduce a variable-layer decomposition architecture (VLD-MMDiT), a unified RGBA latent-space VAE, and a multi-stage transfer training strategy; we also construct the first high-quality, PSD-driven multilayer image generation pipeline. Experiments demonstrate that our model significantly outperforms state-of-the-art methods in decomposition fidelity, layer-count flexibility, and editing independence. It supports decomposition into arbitrary numbers of layers and enables fine-grained, layer-level editing—establishing a novel paradigm for consistent, semantically grounded image editing.

Technology Category

Application Category

📝 Abstract
Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose extbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling extbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}
Problem

Research questions and friction points this paper is trying to address.

Decomposes single RGB images into multiple editable RGBA layers
Enables independent layer manipulation without affecting other content
Addresses consistency issues in image editing via layer decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes images into multiple RGBA layers for editing
Uses RGBA-VAE and VLD-MMDiT for variable layer decomposition
Employs multi-stage training to adapt pretrained models