Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

πŸ“… 2026-06-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

192K/year
πŸ€– AI Summary
This study addresses the limitations of existing question-answering systems in handling complex or ambiguous queries, which often stem from insufficient contextual understanding, inconsistent responses, and poor cross-domain generalization. Building upon pre-trained large language models such as RoBERTa-base, the authors perform supervised fine-tuning on the SQuAD1.1 dataset to significantly enhance the model’s ability to accurately comprehend context and extract precise answers. Experimental results demonstrate that the fine-tuned model achieves strong performance across multiple evaluation metrics, including ROUGE-L (86.84%), BLEU (28.24%), and BERTScore (95.38%). These improvements effectively mitigate issues related to irrelevant or vague responses, thereby validating the approach’s notable gains in answer accuracy, relevance, and cross-domain adaptability.
πŸ“ Abstract
Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.
Problem

Research questions and friction points this paper is trying to address.

question answering
answer extraction
context understanding
large language models
answer accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-tuning
large language models
context-based question answering
answer extraction
SQuAD