CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper addresses the problem of high noise and low performance in online video multimodal retrieval caused by independent unimodal modeling. To tackle this, we propose a dynamic modality selection mechanism that adaptively fuses the most relevant modalities—visual, audio, OCR text, and metadata—based on the query. Our key contributions are: (1) a novel modality-aware contrastive loss that explicitly models inter-modal relevance; (2) the construction of MultiVENT 2.0++, a large-scale synthetic dataset featuring modality-targeted queries to support dynamic selection training; and (3) a unified backbone with late-interaction architecture enabling learnable modality-weighted fusion. Experiments show that our method achieves +25.6 and +35.4 nDCG@10 improvements over the best single- and multi-modal baselines on MultiVENT 2.0++, and gains +3.50% and +1.42% accuracy on long-video QA benchmarks Video-MME and LongVideoBench, respectively.

Technology Category

Application Category

📝 Abstract

Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal video content retrieval accuracy

Dynamic selection of relevant modalities for queries

Addressing noise from irrelevant modalities in retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly indexes video frames, speech, text, metadata

Uses unified multimodal backbone for contextualization

Introduces modality-aware loss for dynamic selection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs