Multi-View Foundation Models

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-view vision foundation models (e.g., DINO, SAM, CLIP) suffer from inconsistent feature representations for the same 3D point across multiple views, limiting their applicability in multi-view understanding. To address this, we propose a multi-view foundation model framework that operates without explicit 3D reconstruction. Our core innovation is a learnable, intermediate-layer 3D-aware attention mechanism that implicitly encodes cross-view geometric constraints directly in image space, jointly optimized via contrastive learning to enforce feature consistency. Built upon a Transformer architecture, the model processes multiple input views end-to-end, enabling explicit cross-view feature alignment. Evaluated on surface normal estimation and multi-view segmentation, our approach significantly improves feature matching accuracy—achieving consistent quantitative gains over single-view baselines across all standard metrics.

Technology Category

Application Category

📝 Abstract
Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.
Problem

Research questions and friction points this paper is trying to address.

Enhance feature consistency across multiple 2D views of a 3D scene
Convert single-view foundation models to multi-view without 3D reconstruction
Improve tasks like surface normal estimation and multi-view segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convert single-view foundation models to multi-view
Add 3D-aware attention layers for cross-view consistency
Bypass 3D reconstruction by operating in image space
🔎 Similar Papers
No similar papers found.
Leo Segre
Leo Segre
PhD student
Computer Vision
O
Or Hirschorn
Tel Aviv University
S
Shai Avidan
Tel Aviv University