🤖 AI Summary
This work proposes LaDe, a novel framework that overcomes the limitations of existing layered image generation methods—such as fixed layer counts or constrained spatial continuity—to flexibly represent complex semantic designs. LaDe is the first end-to-end approach capable of generating a variable number of semantically explicit RGBA layers, unifying three core tasks: text-to-image synthesis, text-to-layered-design generation, and design decomposition. Key innovations include leveraging large language model prompting to enhance semantic alignment, introducing 4D Rotary Position Embedding (RoPE) to model spatial relationships across multiple layers, and designing an RGBA variational autoencoder that fully supports alpha channel editing. Evaluated on the Crello benchmark, LaDe significantly outperforms Qwen-Image-Layered, with GPT-4o mini and Qwen3-VL assessments confirming markedly improved text-to-layer semantic consistency.
📝 Abstract
Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).