CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of large parameter counts, slow training, and deployment difficulty in 3D representation learning—stemming from scarce 3D data—this paper proposes CrossJEPA, the first non-masked joint embedding prediction architecture for cross-modal (image-to-point-cloud) learning. Its key contributions are: (1) a frozen teacher model coupled with a one-time target embedding caching mechanism, which purifies supervision signals and substantially improves training efficiency; and (2) integration of knowledge distillation from image foundation models with a cross-domain conditioned predictor, enabling lightweight, aligned cross-modal representations. Evaluated on ModelNet40 and ScanObjectNN, CrossJEPA achieves linear probe accuracies of 94.2% and 88.3%, respectively, using only 14.1M parameters and requiring approximately six hours of training on a single GPU. It attains state-of-the-art performance while offering strong deployment efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient 3D representation learning from 2D images
Reduces computational cost and model size for cross-modal training
Enables efficient knowledge transfer between 2D and 3D modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal joint embedding predictive architecture for 3D learning
Predicts 2D view embeddings from 3D point clouds
Uses frozen teacher with target embedding caching
🔎 Similar Papers
No similar papers found.
A
Avishka Perera
University of Moratuwa
K
Kumal Hewagamage
University of Moratuwa
S
Saeedha Nazar
University of Moratuwa
K
Kavishka Abeywardana
University of Moratuwa
H
Hasitha Gallella
University of Moratuwa
Ranga Rodrigo
Ranga Rodrigo
Department of Electronic and Telecommunication Engineering, University of Moratuwa
Computer Vision
Mohamed Afham
Mohamed Afham
Technische Universität Darmstadt