Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

πŸ“… 2025-08-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address poor zero-shot cross-domain and cross-lingual multimodal retrieval performance on private documents, this paper proposes PREMIRβ€”a novel framework that leverages multimodal large language models (MLLMs) to generate cross-modal pre-questions (preQs). By jointly optimizing embedding-space matching and token-level semantic alignment, PREMIR enables multimodal co-expansion, overcoming the limitations of conventional single-vector-space retrieval. Crucially, it requires no target-domain annotations and supports closed-domain deployment and multilingual scenarios. PREMIR achieves state-of-the-art results on multiple out-of-distribution benchmarks, significantly outperforming strong baselines. Ablation studies and qualitative analysis confirm its strong cross-domain generalization, robustness, and practical applicability.

Technology Category

Application Category

πŸ“ Abstract
Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.
Problem

Research questions and friction points this paper is trying to address.

Retrieving multimodal documents from unseen domains
Overcoming limitations of single-vector-space embedding comparison
Handling private documents across different languages and domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages MLLM to generate cross-modal pre-questions
Expands matching scope to token level using multiple modalities
Achieves state-of-art performance on out-of-distribution benchmarks
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
J
Jaewoo Park
Yonsei University
J
Janghan Yoon
Yonsei University
Saejin Kim
Saejin Kim
Yonsei University
Artificial Intelligence
J
Jaehyun Jeon
Yonsei University
Y
Youngjae Yu
Seoul National University