CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of large parameter counts, slow training, and deployment difficulty in 3D representation learning—stemming from scarce 3D data—this paper proposes CrossJEPA, the first non-masked joint embedding prediction architecture for cross-modal (image-to-point-cloud) learning. Its key contributions are: (1) a frozen teacher model coupled with a one-time target embedding caching mechanism, which purifies supervision signals and substantially improves training efficiency; and (2) integration of knowledge distillation from image foundation models with a cross-domain conditioned predictor, enabling lightweight, aligned cross-modal representations. Evaluated on ModelNet40 and ScanObjectNN, CrossJEPA achieves linear probe accuracies of 94.2% and 88.3%, respectively, using only 14.1M parameters and requiring approximately six hours of training on a single GPU. It attains state-of-the-art performance while offering strong deployment efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient 3D representation learning from 2D images

Reduces computational cost and model size for cross-modal training

Enables efficient knowledge transfer between 2D and 3D modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal joint embedding predictive architecture for 3D learning

Predicts 2D view embeddings from 3D point clouds

Uses frozen teacher with target embedding caching

🔎 Similar Papers

No similar papers found.

Authors to Follow