Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenge of retrieving coherent, topic-relevant multimodal dialogue segments—comprising both text and images—rather than isolated utterances, in long multi-turn conversations. It formally introduces the fine-grained fragment retrieval (FFR) task and presents MLDR, the longest multimodal dialogue retrieval dataset to date. The authors propose two tailored solutions: F2RVLM, a generative model trained via reinforcement learning for single-dialogue scenarios, and FFRS, a two-stage system combining offline indexing with online fine-grained reasoning for large-scale corpora. Extensive experiments demonstrate that both approaches significantly outperform existing baselines on MLDR and a real-world WeChat test set, confirming their effectiveness and practical utility in multimodal dialogue understanding and retrieval.
📝 Abstract
With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.
Problem

Research questions and friction points this paper is trying to address.

fine-grained fragment retrieval
multi-modal dialogue
long-form dialogue
fragment coherence
dialogue retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained Fragment Retrieval
Multi-modal Dialogue
Reinforcement Learning
Fragment Embedding Model
Two-stage Retrieval
H
Hanbo Bi
Pattern Recognition Center, WeChat AI, Tencent Inc, China; Aerospace Information Research Institute, Chinese Academy of Sciences
Zhiqiang Yuan
Zhiqiang Yuan
fudan university
C
Chongyang Li
Pattern Recognition Center, WeChat AI, Tencent Inc, China; Aerospace Information Research Institute, Chinese Academy of Sciences
Q
Qiwei Yan
Pattern Recognition Center, WeChat AI, Tencent Inc, China; Aerospace Information Research Institute, Chinese Academy of Sciences
Z
Zexi Jia
Pattern Recognition Center, WeChat AI, Tencent Inc, China
J
Jiapei Zhang
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Xiaoyue Duan
Xiaoyue Duan
Beihang University
image/video generationmusic generation
Yingchao Feng
Yingchao Feng
Aerospace Information Research Institute, Chinese Academy of Sciences
Machine learning in visionStatistical and structural pattern recognitionImage/video analysis and understandingRemote sensing image understandingMachine learning and data mining with applications to remote sensing
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System
Jie Zhou
Jie Zhou
Tencent Wechat AI
nlp