🤖 AI Summary
To address the challenges of large parameter counts, slow training, and deployment difficulty in 3D representation learning—stemming from scarce 3D data—this paper proposes CrossJEPA, the first non-masked joint embedding prediction architecture for cross-modal (image-to-point-cloud) learning. Its key contributions are: (1) a frozen teacher model coupled with a one-time target embedding caching mechanism, which purifies supervision signals and substantially improves training efficiency; and (2) integration of knowledge distillation from image foundation models with a cross-domain conditioned predictor, enabling lightweight, aligned cross-modal representations. Evaluated on ModelNet40 and ScanObjectNN, CrossJEPA achieves linear probe accuracies of 94.2% and 88.3%, respectively, using only 14.1M parameters and requiring approximately six hours of training on a single GPU. It attains state-of-the-art performance while offering strong deployment efficiency and scalability.
📝 Abstract
Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.