Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization

πŸ“… 2026-06-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inconsistency between indexing and decoding in multimodal generative retrieval, which often causes premature pruning of target identifier prefixes during early beam search stages. The study formally characterizes this β€œindex-decoding gap” for the first time, derives a survival bound for prefix retention, and introduces PROβ€”a unified optimization framework. PRO jointly enhances prefix discriminability during both training and decoding by integrating prefix-ranking distillation, hierarchical vocabulary scheduling, and geometric score fusion. Additionally, it improves efficiency and accuracy through residual quantization and Trie-constrained beam search. Evaluated across nine benchmark tasks, PRO substantially outperforms existing methods, significantly increasing target prefix retention rates and overall retrieval performance.
πŸ“ Abstract
Multimodal generative retrieval formulates multimodal retrieval as discrete identifier generation, eliminating the need for explicit similarity search over external embeddings. Existing approaches construct identifiers via residual quantization and decode them with trie-constrained beam search. This combination introduces an indexing-decoding gap: identifier learning objectives, including reconstruction and contrastive losses, do not explicitly enforce prefix discriminability during decoding. As a result, even well-optimized identifiers can be irreversibly pruned early in beam search due to low-rank prefixes. We theoretically characterize this gap and derive a survival bound that relates prefix retention to three controllable factors in indexing and decoding. Building on this bound, we propose PRO, prefix retention optimization, a unified framework comprising three mechanisms: (i) prefix ranking distillation aligns quantized prefix rankings with those induced by pre-quantization embeddings using a listwise loss; (ii) vocabulary scheduling increases codebook sizes from shallow to deep residual quantization levels to reduce early competition from non-target prefixes; and (iii) geometric score fusion vectorizes each candidate prefix and incorporates its similarity to the query into beam search scoring, further reducing the indexing-decoding mismatch. Experiments on nine multimodal retrieval tasks show that PRO improves retention of target identifier prefixes and outperforms existing multimodal generative retrieval baselines.
Problem

Research questions and friction points this paper is trying to address.

multimodal generative retrieval
indexing-decoding gap
prefix discriminability
beam search
identifier generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

prefix retention optimization
multimodal generative retrieval
indexing-decoding gap
residual quantization
trie-constrained beam search