ACL-Verbatim: hallucination-free question answering for research

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of hallucination in large language models (LLMs) when answering academic questions by proposing an extractive, verifiable question-answering approach. The authors introduce the first benchmark dataset for academic papers—built from the ACL Anthology—that maps natural-language queries to verbatim spans in source documents, combining expert annotations with synthetically generated queries via ScIRGen to establish a high-quality training and evaluation framework. Their system leverages the VerbatimRAG retrieval architecture and a 150M-parameter ModernBERT extractive model to achieve word-level precise answer localization. Experimental results demonstrate that the proposed model attains a word-level F1 score of 53.6, significantly outperforming the strongest LLM-based extractor (48.7), thereby enhancing both accuracy and trustworthiness in academic question answering.

📝 Abstract

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

Problem

Research questions and friction points this paper is trying to address.

hallucination

question answering

academic research

extractive QA

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

extractive question answering

hallucination mitigation

VerbatimRAG