Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Semantic-similarity-based RAG systems for Italian educational question answering suffer from insufficient factual accuracy due to lexical ambiguity and domain-specific terminology. Method: We propose a hybrid re-ranking framework integrating Wikidata entity linking into the retrieval pipeline, jointly performing named entity recognition and disambiguation. It combines three re-ranking strategies—hybrid score weighting, reciprocal rank fusion (RRF), and cross-encoder fine-tuning—and systematically evaluates their performance on both domain-specific (academic) and general-purpose (SQuAD-it) benchmarks. Contribution/Results: Experiments show that RRF significantly improves factual accuracy in the target academic domain, whereas the cross-encoder excels in general-domain settings. Crucially, domain adaptation and entity-aware re-ranking are identified as key mechanisms for enhancing factual precision. This work provides a reproducible technical pathway and empirical evidence for improving trustworthiness in multilingual, domain-specific RAG systems.

Technology Category

Application Category

📝 Abstract

In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.

Problem

Research questions and friction points this paper is trying to address.

Improves factual accuracy in educational question-answering systems

Addresses retrieval relevance issues from terminological ambiguity in specialized domains

Enhances RAG with entity linking and hybrid re-ranking strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Entity Linking with RAG for factual accuracy

Uses hybrid re-ranking strategies combining semantic and entity signals

Demonstrates domain adaptation for educational question-answering systems

🔎 Similar Papers

EntGPT: Entity Linking with Generative Large Language Models