Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitations of traditional visual document retrieval models that rely on static document representations and struggle to accommodate query-sensitive requirements involving tables, charts, or layout structures. The authors propose Argus, a query-conditioned late-interaction retriever built upon Qwen3.5-VL, which introduces for the first time a query-aware region-level mixture-of-experts (MoE) routing mechanism to dynamically generate query-dependent document representations $D(q)$. By integrating spatial region pooling, compact contextual vectors, and MaxSim scoring, Argus maintains compatibility with multi-vector indexing while significantly improving retrieval accuracy. The Argus-9B variant achieves 92.67 NDCG@5 on ViDoRe V1 and 86.0 on the combined V1+V2 leaderboard, establishing a new state-of-the-art among open-source late-interaction models. When integrated into the Qwen3.6-27B agent, it further attains 64.80 NDCG@10 on ViDoRe V3.

📝 Abstract

Late-interaction vision-language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout-sensitive evidence request. We introduce \textbf{Argus}, a family of query-conditioned late-interaction retrievers built on Qwen3.5-VL. Argus adds a region-aware Mixture-of-Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query-aware router selects latent experts per region before MaxSim. The output remains a multi-vector index compatible with ColPali-style retrieval, but the document representation is now dependent on the query (i.e., $\mathbf{D}(q)$). All Argus models use a 1024-dimensional retrieval head, compared with the 2560-dimensional and 4096-dimensional heads of recent state-of-the-art systems, and are trained on roughly 9\% of the available public supervision rather than the full pool. The 9B model reaches \textbf{92.67} NDCG@5 on ViDoRe V1 and \textbf{86.0} NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late-interaction model on the combined leaderboard. Wrapped in a Qwen3.6-27B agentic retrieval pipeline on ViDoRe V3, Argus-9B further improves its NDCG@10 from 60.28 to \textbf{64.80} over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.

Problem

Research questions and friction points this paper is trying to address.

late-interaction retrieval

query-conditioned representation

visual document retrieval

region-aware modeling

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-conditioned retrieval

region-aware MoE

late-interaction retriever