🤖 AI Summary
Existing sketch coloring methods suffer from imprecise text-guided colorization, heavy reliance on manual prompt engineering, and artifact-prone image references. Addressing these challenges for animation production, this paper proposes a diffusion-based structure-color disentanglement coloring framework: it leverages sketches as geometric guidance and RGB images as color references. We introduce, for the first time, a split cross-attention mechanism and a LoRA fine-tuning module to independently model and controllably edit foreground and background features. Additionally, spatial masking guidance and a switchable inference mode are incorporated to mitigate inter-region interference and spatial artifacts. Experiments demonstrate that our method consistently produces high-fidelity, artifact-free results—even under severe geometric misalignment—outperforming state-of-the-art approaches in qualitative evaluation, quantitative metrics (e.g., LPIPS, FID), and user studies. Ablation studies validate the effectiveness of each component.
📝 Abstract
Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at https://github.com/ tellurion-kanata/colorizeDiffusion.