Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the incompatibility between vision-language models (VLMs) and pure vision foundation models (VFMs) in both semantic and geometric properties by proposing the GPUA framework—the first approach to adapt cross-lingual alignment principles to the fusion of heterogeneous vision foundation models. GPUA employs unsupervised orthogonal mapping to align VFM features, treated as a “visual language,” into the semantic space of a VLM without updating either model’s parameters, while preserving their intrinsic geometric structure. The method is entirely label-free, task-agnostic, and incurs minimal computational overhead. Extensive experiments demonstrate that GPUA significantly enhances cross-model compatibility across multiple benchmarks and yields notable performance gains in zero-shot recognition and segmentation tasks.
📝 Abstract
Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA
Problem

Research questions and friction points this paper is trying to address.

foundation models
heterogeneous alignment
geometry preservation
unsupervised learning
cross-model compatibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-Preserving
Unsupervised Alignment
Foundation Models
Cross-Modal Integration
Zero-Shot Transfer