An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate semantic retrieval in academic texts—particularly course syllabi—caused by lexical diversity, implicit expression, and structural heterogeneity, this paper introduces an open-source embedding model tailored for higher education. Methodologically, it proposes a novel dual-loss joint training framework combining MultipleNegativesRankingLoss and CosineSimilarityLoss to jointly optimize semantic ranking and similarity calibration. We construct the first synthetic, human-annotated dataset for education (3,197 sentence pairs), integrating LLM-generated candidates with expert semantic labeling. Additionally, we establish a cross-syllabus retrieval evaluation framework. Evaluated on real syllabi from 28 universities, our model significantly outperforms baselines including all-MiniLM-L6-v2 and approaches the performance of OpenAI’s text-embedding-3. The model is already deployed in production RAG systems, academic Q&A bots, and learning management system integrations.

Technology Category

Application Category

📝 Abstract
Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.
Problem

Research questions and friction points this paper is trying to address.

Enhancing semantic retrieval for academic content in higher education
Improving embedding models for educational question answering
Bridging performance gap with proprietary embeddings using dual-loss training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source dual-loss embedding model for education
Synthetic dataset with manual and LLM-assisted generation
Combines MNRL and CosineSimilarityLoss for better performance
🔎 Similar Papers
No similar papers found.