🤖 AI Summary
Existing unsupervised methods struggle to recover semantic factors in real-world data, while supervised approaches often rely on adversarial training or classifiers, suffering from instability and poor scalability. This work proposes a weakly supervised variational autoencoder framework that leverages contrastive supervision—specifically, the InfoNCE loss—to map designated factors into independent subspaces. By integrating KL regularization to impose a Gaussian structure on the latent space, the model achieves disentangled representations with explicit factor control. Notably, the approach eliminates the need for adversarial training or additional annotations. Evaluated on multiple real-world datasets including CelebA, it attains state-of-the-art disentanglement performance using fixed hyperparameters, enabling high-quality factor alignment and controllable attribute swapping.
📝 Abstract
Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textsc{XFactors}, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace $\mathcal{S}$ and factor-specific subspaces $\mathcal{T}_1,\ldots,\mathcal{T}_K$ and a residual subspace $\mathcal{S}$. Each target factor is encoded in its assigned $\mathcal{T}_i$ through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both $\mathcal{S}$ and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textsc{XFactors} achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}.