🤖 AI Summary
This work bridges the gap between variational modeling and modern self-supervised learning by proposing the first decoder-free variational self-supervised learning framework. To address the limitations of reconstruction-based objectives, the method employs a dual-encoder architecture with momentum-updated teacher networks, replacing pixel-level reconstruction with cross-view denoising. It introduces a cosine-scaled KL divergence and likelihood term to enforce semantic alignment in high-dimensional latent spaces. By eliminating the explicit decoder, the framework significantly improves training efficiency and representation quality. Extensive experiments on CIFAR-10/100 and ImageNet-100 demonstrate competitive or superior performance compared to state-of-the-art methods including BYOL and MoCo v3. These results validate the effectiveness and scalability of integrating probabilistic modeling principles—particularly variational inference—with contrastive-free self-supervision, offering a novel paradigm for efficient, principled representation learning.
📝 Abstract
We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.