FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context extractive QA models exhibit weak refusal capability and poor factual consistency. Method: We propose a multi-agent collaborative data augmentation framework wherein specialized agents—performing fact extraction, question generation, and contradiction verification—cooperate to automatically synthesize high-quality answerable/unalanswerable question pairs over long contexts (8K–128K tokens). Our approach integrates evidence-chain-guided unanswerable question synthesis and context-aware hard-example mining, eliminating reliance on manual annotation. Contribution/Results: This is the first work enabling large-scale, controllable hard-example generation without human labeling—filling a critical gap since SQuAD 2.0. We introduce FactGuard-Bench, a benchmark of 25,220 instances, on which seven mainstream LLMs achieve only 61.79% average accuracy, underscoring the severity of the problem. Our method significantly improves refusal robustness and factual consistency, establishing a new paradigm for trustworthy training of long-context LLMs.

Technology Category

Application Category

📝 Abstract
Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Improving accuracy in distinguishing answerable and unanswerable questions for LLMs
Reducing manual annotation costs for long-context question-answer datasets
Enhancing LLM reasoning to avoid plausible but incorrect answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system generates question-answer pairs
Autonomously constructs unanswerable questions systematically
Reduces manual annotation costs efficiently
🔎 Similar Papers
No similar papers found.
Qian-Wen Zhang
Qian-Wen Zhang
腾讯科技
F
Fang Li
Tencent YouTu Lab, Beijing, China
J
Jie Wang
Tencent YouTu Lab, Beijing, China
L
Lingfeng Qiao
Tencent YouTu Lab, Beijing, China
Y
Yifei Yu
Tencent YouTu Lab, Beijing, China
Di Yin
Di Yin
Tencent
LLMNLPMLLM
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent