Self-Supervised Weight Templates for Scalable Vision Model Initialization

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SWEET, a scalable framework for initializing vision models that overcomes the inflexibility of conventional pre-trained models in adapting to architectures of varying depth and width. SWEET leverages self-supervised learning to construct a shared-weight template together with lightweight, size-specific scalers, enabling modular parameter generation under Tucker tensor decomposition constraints. A key innovation is the introduction of a stochastic width-scaling strategy during training, which substantially enhances cross-width generalization. Extensive experiments demonstrate that SWEET achieves state-of-the-art performance across diverse tasks—including image classification, object detection, semantic segmentation, and generative modeling—while enabling efficient initialization and transfer of vision models at multiple scales.

Technology Category

Application Category

📝 Abstract
The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
Problem

Research questions and friction points this paper is trying to address.

scalable initialization
vision models
model scaling
weight templates
self-supervised pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
scalable model initialization
weight template
Tucker decomposition
width-wise stochastic scaling
🔎 Similar Papers
No similar papers found.
Y
Yucheng Xie
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
F
Fu Feng
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
R
Ruixiao Shi
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Jing Wang
Jing Wang
Nanjing University
Bandit
Y
Yong Rui
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Xin Geng
Xin Geng
School of Computer Science and Engineering, Southeast University
Artificial IntelligencePattern RecognitionMachine Learning