🤖 AI Summary
Persian has long been underrepresented in large-scale text embedding research, hindering its natural language understanding and retrieval performance. To address this gap, we propose the first Persian text embedding model explicitly designed for dialogue-history-aware retrieval and retrieval-augmented generation (RAG) optimization. Our method introduces three novel benchmark datasets—Corpesia, Pairsia-sup, and Pairsia-unsup—and a dual-path architecture integrating BERT and RetroMAE backbones, jointly trained via supervised and unsupervised contrastive learning with dialogue history augmentation. Experimental results demonstrate that our model outperforms all existing Persian embedding models by 8.5% on the FaMTEB benchmark, yielding substantial improvements in question answering, semantic search, and RAG retrieval accuracy. This work establishes the first high-performance, task-specialized embedding model for Persian, effectively bridging a critical gap in multilingual embedding research.
📝 Abstract
Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our language model consistently achieves higher accuracy across various Persian NLP tasks, while the RetroMAE-based model proves particularly effective for textual information retrieval applications. Together, these contributions establish a new foundation for advancing Persian language understanding.