🤖 AI Summary
Existing reference-based sketch coloring methods assume semantic and spatial alignment among sketches, reference images, and ground-truth colorings during training; however, significant spatial misalignment commonly occurs during inference, causing distribution shift, overfitting, and artifacts. To address this train-inference mismatch, we propose a decoupled reference representation framework: a reference image is encoded separately by a semantic encoder (capturing high-level category structure) and a style encoder (modeling local texture and color), with their features injected into sketch features in distinct stages; further, a multi-granularity loss jointly optimizes semantic consistency, style fidelity, and chromatic accuracy. This design enhances robustness to spatial misalignment and supports flexible reference selection. Quantitative and qualitative evaluations on multiple benchmarks demonstrate consistent superiority over state-of-the-art methods. A user study further confirms significant improvements in both coloring quality and reference style preservation.
📝 Abstract
Reference-based sketch colorization methods have garnered significant attention for the potential application in animation and digital illustration production. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially similar, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in artifacts and signif- icant quality degradation in colorization results. To address this issue, we conduct an in-depth analysis of the reference representations, defined as the intermedium to transfer information from reference to sketch. Building on this analysis, we introduce a novel framework that leverages distinct reference representations to optimize different aspects of the colorization process. Our approach decomposes colorization into modular stages, al- lowing region-specific reference injection to enhance visual quality and reference similarity while mitigating spatial artifacts. Specifically, we first train a backbone network guided by high-level semantic embeddings. We then introduce a background encoder and a style encoder, trained in separate stages, to enhance low-level feature transfer and improve reference similar- ity. This design also enables flexible inference modes suited for a variety of use cases. Extensive qualitative and quantitative evaluations, together with a user study, demonstrate the superior performance of our proposed method compared to existing approaches. Code and pre-trained weight will be made publicly available upon paper acceptance.