Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of multimodal retrieval in visually rich, text-dense documents by proposing a unified framework capable of handling both closed-set long-document page retrieval and open-domain Wikipedia-style paragraph retrieval. Building upon the Qwen2-VL multimodal large language model, the study systematically compares strategies including full fine-tuning, training-free multi-path fusion, and zero-shot late interaction, revealing for the first time that decoder-based architectures yield superior multimodal embeddings compared to traditional CLIP-style encoders. Experimental results demonstrate that the best training-free system lags only 0.1 points behind the fine-tuned champion in macro-averaged Recall@{1,3,5}, confirming the feasibility of highly effective yet efficient multimodal retrieval. The project also organized an international challenge, attracting 22 teams and 586 submissions, thereby advancing methodological innovation and standardizing evaluation in multimodal retrieval.

📝 Abstract

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

visually-rich documents

document retrieval

image-text query

retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval

Qwen2-VL

training-free fusion