🤖 AI Summary
To address inaccurate semantic retrieval in academic texts—particularly course syllabi—caused by lexical diversity, implicit expression, and structural heterogeneity, this paper introduces an open-source embedding model tailored for higher education. Methodologically, it proposes a novel dual-loss joint training framework combining MultipleNegativesRankingLoss and CosineSimilarityLoss to jointly optimize semantic ranking and similarity calibration. We construct the first synthetic, human-annotated dataset for education (3,197 sentence pairs), integrating LLM-generated candidates with expert semantic labeling. Additionally, we establish a cross-syllabus retrieval evaluation framework. Evaluated on real syllabi from 28 universities, our model significantly outperforms baselines including all-MiniLM-L6-v2 and approaches the performance of OpenAI’s text-embedding-3. The model is already deployed in production RAG systems, academic Q&A bots, and learning management system integrations.
📝 Abstract
Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.