🤖 AI Summary
Existing multi-vector visual document retrieval models incur substantial computational and storage overhead and lack a unified mechanism to jointly modulate accuracy and efficiency along both vector dimensionality and encoder depth. This work proposes a two-dimensional multimodal Matryoshka training framework that, for the first time, enables budget elasticity along both vector width and network depth, allowing a single model to be configured on-demand for varying inference resource constraints without retraining for each setting. By integrating multi-vector representations from vision-language models, two-dimensional Matryoshka nested learning, and a ColPali-style retrieval architecture, the method significantly outperforms naive truncation baselines across diverse backbone networks, achieving substantial reductions in storage and computational costs while maintaining or even improving retrieval performance.
📝 Abstract
Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.