Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of constructing a unified embedding space for full-modal retrieval—encompassing text, images, video, documents, and audio—amidst significant disparities in data distributions, model architectures, and optimization dynamics across modalities. The authors propose a decouple–fuse–recover framework: modality-specific expert models are first trained independently, and their task vectors are fused into a single dense backbone. To mitigate performance degradation in audio and projector drift, the method employs full-parameter fine-tuning of a projector alongside a multimodal replay mechanism. With the backbone frozen, this approach efficiently integrates heterogeneous modalities, achieving state-of-the-art results of 74.9 on the MMEB benchmark and 55.61 on the MAEB audio suite comprising 30 tasks.
📝 Abstract
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.
Problem

Research questions and friction points this paper is trying to address.

omni-modal retrieval
multi-modal embedding
modality fusion
projector drift
unified embedding space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Specialist Fusion
Projector Drift
Projector Recovery
Omni-Modal Embedding
Multi-Modal Rehearsal
🔎 Similar Papers