KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly focus on single-turn question answering and lack systematic evaluation of factual accuracy and information efficiency in knowledge-intensive multi-turn, long-form dialogues—particularly in domains such as medicine, finance, and law. To address this gap, we introduce KnowMT-Bench, the first dedicated benchmark for knowledge-intensive multi-turn QA. It features dynamically generated multi-turn dialogue histories and an automated evaluation pipeline integrating human verification with retrieval-augmented generation (RAG) to quantify how contextual noise degrades model performance. Experimental results demonstrate that multi-turn interaction significantly reduces both factual accuracy and information efficiency; however, RAG effectively mitigates—and in some cases reverses—this degradation, substantially improving answer quality. This work fills a critical void in evaluating multi-turn QA for knowledge-intensive applications and establishes a new paradigm for assessing the reliability of large language models in professional, domain-specific settings.

Technology Category

Application Category

📝 Abstract
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce extbf{KnowMT-Bench}, the extit{first-ever} benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model's real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the extit{final-turn} answer are then evaluated using a human-validated automated pipeline. Our experiments reveal that multi-turn contexts degrade performance: factual capability declines due to the contextual noise from self-generated histories, while information efficiency drops as models become more verbose with increasing dialogue length. We then investigate mitigation strategies, demonstrating that retrieval-augmented generation (RAG) can effectively alleviate and even reverse this factual degradation. These findings underscore the importance of our benchmark in evaluating and enhancing the conversational factual capabilities of LLMs in real-world knowledge-intensive applications. Code is available at href{https://github.com/hardenyu21/KnowMT-Bench}{ extcolor{cyan}{ exttt{KnowMT-Bench}}}.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking knowledge-intensive multi-turn question answering
Assessing factual accuracy degradation in long dialogues
Evaluating retrieval-augmented generation mitigation strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first multi-turn knowledge-intensive dialogue benchmark
Uses dynamic evaluation with self-generated dialogue histories
Demonstrates retrieval-augmented generation mitigates factual degradation
🔎 Similar Papers
No similar papers found.
J
Junhao Chen
Hong Kong University of Science and Technology (Guangzhou)
Y
Yu Huang
Hong Kong University of Science and Technology (Guangzhou)
S
Siyuan Li
Hong Kong University of Science and Technology (Guangzhou)
R
Rui Yao
Hong Kong University of Science and Technology (Guangzhou)
Hanqian Li
Hanqian Li
M.Phil @HKUST(GZ)
Computer VisionLarge Language ModelNatural Language Processing
Hanyu Zhang
Hanyu Zhang
Lecturer, Nankai University
optometrymyopia
J
Jungang Li
Hong Kong University of Science and Technology (Guangzhou)
J
Jian Chen
Hong Kong University of Science and Technology (Guangzhou)
B
Bowen Wang
The University of Osaka
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model