🤖 AI Summary
This work addresses the challenge of reduced question-answering accuracy in multi-version software systems, where documentation across versions is highly similar yet contains subtle differences that confuse existing QA systems. To tackle this issue, the authors propose QAMR, a novel chatbot that introduces a retrieval-augmented generation (RAG) framework specifically tailored for multi-version documentation. The framework incorporates a dual-chunking strategy—optimizing chunks separately for retrieval and generation—along with query rewriting and context selection mechanisms. Evaluated on both real-world industrial data and public benchmarks, QAMR achieves a question-answering accuracy of 88.5% and a retrieval accuracy of 90%, representing improvements of 16.5% and 12% over baseline methods, respectively, while also reducing response time by 8%.
📝 Abstract
Companies regularly have to contend with multi-release systems, where several versions of the same software are in operation simultaneously. Question answering over documents from multi-release systems poses challenges because different releases have distinct yet overlapping documentation. Motivated by the observed inaccuracy of state-of-the-art question-answering techniques on multi-release system documents, we propose QAMR, a chatbot designed to answer questions across multi-release system documentation. QAMR enhances traditional retrieval-augmented generation (RAG) to ensure accuracy in the face of highly similar yet distinct documentation for different releases. It achieves this through a novel combination of pre-processing, query rewriting, and context selection. In addition, QAMR employs a dual-chunking strategy to enable separately tuned chunk sizes for retrieval and answer generation, improving overall question-answering accuracy. We evaluate QAMR using a public software-engineering benchmark as well as a collection of real-world, multi-release system documents from our industry partner, Ciena. Our evaluation yields five main findings: (1) QAMR outperforms a baseline RAG-based chatbot, achieving an average answer correctness of 88.5% and an average retrieval accuracy of 90%, which correspond to improvements of 16.5% and 12%, respectively. (2) An ablation study shows that QAMR's mechanisms for handling multi-release documents directly improve answer accuracy. (3) Compared to its component-ablated variants, QAMR achieves a 19.6% average gain in answer correctness and a 14.0% average gain in retrieval accuracy over the best ablation. (4) QAMR reduces response time by 8% on average relative to the baseline. (5) The automatically computed accuracy metrics used in our evaluation strongly correlate with expert human assessments, validating the reliability of our methodology.