🤖 AI Summary
This work addresses the limitation of existing painting process generation methods, which rely heavily on task-specific datasets and labor-intensive manual annotations, thereby hindering generalization to arbitrary input images. To overcome this, we propose a fully self-supervised framework that requires no ground-truth drawing sequences. Methodologically, we introduce a novel self-supervised video dataset construction pipeline leveraging depth estimation and differentiable stroke rendering. We design a dedicated fusion layer that explicitly models two fundamental human drawing behaviors—refinement and layering—and integrate it into a video diffusion architecture to enable temporally coherent, reverse stroke-generation. Experiments demonstrate that our approach produces high-fidelity, temporally consistent, human-like painting videos across diverse image categories, significantly outperforming supervised baselines. To our knowledge, this is the first method capable of universal painting process generation without any manually annotated drawing data.
📝 Abstract
Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as"refinement"and"layering"processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.