Improving Joint Embedding Predictive Architecture with Diffusion Noise

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited representation capability in self-supervised learning (SSL) by proposing N-JEPA, the first framework to deeply integrate diffusion-style noise mechanisms with masked image modeling. Methodologically, it reformulates diffusion noise as a structured masking process, introduces learnable mask tokens augmented with positional embeddings, and designs a multi-stage noise scheduling strategy to enhance feature robustness. Built upon the JEPA architecture, N-JEPA performs noise-aware hierarchical feature reconstruction, enabling discriminative representation learning guided by generative priors. Extensive experiments demonstrate that N-JEPA significantly outperforms state-of-the-art SSL methods—including MAE and SimMIM—on downstream classification tasks such as ImageNet-1K, validating both the effectiveness and generalizability of noise-driven representation learning. The code will be made publicly available.

Technology Category

Application Category

📝 Abstract
Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a close relationship between masked image modeling (MIM) and diffusion models. In this paper, we propose N-JEPA (Noise-based JEPA) to incorporate diffusion noise into MIM by the position embedding of masked tokens. The multi-level noise schedule is a series of feature augmentations to further enhance the robustness of our model. We perform a comprehensive study to confirm its effectiveness in the classification of downstream tasks. Codes will be released soon in public.
Problem

Research questions and friction points this paper is trying to address.

Enhancing SSL with diffusion noise for better representation
Bridging SSL and generative models for improved recognition
Using noise schedules to boost model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diffusion noise with self-supervised learning
Uses position embedding for masked tokens
Implements multi-level noise schedule
🔎 Similar Papers
No similar papers found.
Yuping Qiu
Yuping Qiu
The Hong Kong University of Science and Technology (Guangzhou)
Diffusion model controllable image/video generation multi-model understanding and generation
R
Rui Zhu
The Chinese University of Hong Kong, Shenzhen
Y
Ying-cong Chen
The Hong Kong University of Science and Technology (Guangzhou)