Enhancing Reference-based Sketch Colorization via Separating Reference Representations

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reference-based sketch coloring methods assume semantic and spatial alignment among sketches, reference images, and ground-truth colorings during training; however, significant spatial misalignment commonly occurs during inference, causing distribution shift, overfitting, and artifacts. To address this train-inference mismatch, we propose a decoupled reference representation framework: a reference image is encoded separately by a semantic encoder (capturing high-level category structure) and a style encoder (modeling local texture and color), with their features injected into sketch features in distinct stages; further, a multi-granularity loss jointly optimizes semantic consistency, style fidelity, and chromatic accuracy. This design enhances robustness to spatial misalignment and supports flexible reference selection. Quantitative and qualitative evaluations on multiple benchmarks demonstrate consistent superiority over state-of-the-art methods. A user study further confirms significant improvements in both coloring quality and reference style preservation.

Technology Category

Application Category

📝 Abstract
Reference-based sketch colorization methods have garnered significant attention for the potential application in animation and digital illustration production. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially similar, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in artifacts and signif- icant quality degradation in colorization results. To address this issue, we conduct an in-depth analysis of the reference representations, defined as the intermedium to transfer information from reference to sketch. Building on this analysis, we introduce a novel framework that leverages distinct reference representations to optimize different aspects of the colorization process. Our approach decomposes colorization into modular stages, al- lowing region-specific reference injection to enhance visual quality and reference similarity while mitigating spatial artifacts. Specifically, we first train a backbone network guided by high-level semantic embeddings. We then introduce a background encoder and a style encoder, trained in separate stages, to enhance low-level feature transfer and improve reference similar- ity. This design also enables flexible inference modes suited for a variety of use cases. Extensive qualitative and quantitative evaluations, together with a user study, demonstrate the superior performance of our proposed method compared to existing approaches. Code and pre-trained weight will be made publicly available upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Addresses sketch-reference misalignment in colorization
Reduces artifacts from training-inference data mismatch
Enhances color transfer via modular representation separation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes colorization into modular stages
Uses separate background and style encoders
Enables region-specific reference injection
🔎 Similar Papers
No similar papers found.
D
Dingkun Yan
Tokyo University of Science, Japan
X
Xinrui Wang
The University of Tokyo, Japan
Z
Zhuoru Li
Project HAT, China
S
Suguru Saito
Tokyo University of Science, Japan
Yusuke Iwasawa
Yusuke Iwasawa
The University of Tokyo
deep learningtransfer learningfoundation modelmeta learning
Y
Yutaka Matsuo
The University of Tokyo, Japan
Jiaxian Guo
Jiaxian Guo
Google Research
Efficient Foundation ModelReinforcement LearningCausality