TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

191K/year
πŸ€– AI Summary
This work addresses the limitations of existing deep research agents, which are largely confined to generating text-only reports and lack effective evaluation of the factual accuracy of visual elements and their alignment with analytical content. To bridge this gap, we propose TVIR-Agent, the first generation and evaluation framework tailored for interleaved text-and-visual deep research reports. TVIR-Agent employs a hierarchical multi-agent architecture that collaboratively constructs outlines, retrieves relevant images, generates traceable charts, and performs context-aware interleaved writing. We also introduce TVIR-Bench, a benchmark featuring a dual-path joint text-and-visual evaluation mechanism that emphasizes the role of visual content in supporting specific analytical subgoals. Experiments across nine systems on TVIR-Bench demonstrate that TVIR-Agent achieves substantial performance gains, underscoring the critical importance of explicit multimodal design in evidence-driven report generation.
πŸ“ Abstract
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents
Text-Visual Interleaved Report Generation
Multimodal Evaluation
Visual Reliability
Evidence-Driven Report Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal report generation
deep research agents
text–visual interleaving
hierarchical multi-agent framework
dual-path evaluation
πŸ”Ž Similar Papers
No similar papers found.
X
Xinkai Ma
Nanjing University
Z
Zhiqi Bai
Nanjing University
D
Dingling Zhang
Nanjing University
P
Pei Liu
Nanjing University
Y
Yishuo Yuan
Nanjing University
H
He Zhu
Nanjing University
Jiakai Wang
Jiakai Wang
Zhongguancun Laboratory
Adversarial examplesTrustworthy AI
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
Y
Yifan Zhao
Nanjing University
Xinlong Yang
Xinlong Yang
Peking University | Chongqing University
Multi-modal LearningLarge Language Model
H
Hao Cong
Nanjing University
Z
Zhiheng Yao
Nanjing University
F
Fengxia Xie
Nanjing University
Z
Zihao Xu
Nanjing University
H
Haoran Xu
Nanjing University
Z
Zhaohui Wang
Nanjing University
M
Minghao Liu
Nanjing University
S
Shirong Lin
Nanjing University
Y
Yingshui Tan
Nanjing University
Y
Yuchi Xu
Nanjing University
W
Wenbo Su
Nanjing University
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
B
Bo Zheng
Nanjing University
J
Jiaheng Liu
Nanjing University