Hakim: Farsi Text Embedding Model

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Persian has long been underrepresented in large-scale text embedding research, hindering its natural language understanding and retrieval performance. To address this gap, we propose the first Persian text embedding model explicitly designed for dialogue-history-aware retrieval and retrieval-augmented generation (RAG) optimization. Our method introduces three novel benchmark datasets—Corpesia, Pairsia-sup, and Pairsia-unsup—and a dual-path architecture integrating BERT and RetroMAE backbones, jointly trained via supervised and unsupervised contrastive learning with dialogue history augmentation. Experimental results demonstrate that our model outperforms all existing Persian embedding models by 8.5% on the FaMTEB benchmark, yielding substantial improvements in question answering, semantic search, and RAG retrieval accuracy. This work establishes the first high-performance, task-specialized embedding model for Persian, effectively bridging a critical gap in multilingual embedding research.

Technology Category

Application Category

📝 Abstract

Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our language model consistently achieves higher accuracy across various Persian NLP tasks, while the RetroMAE-based model proves particularly effective for textual information retrieval applications. Together, these contributions establish a new foundation for advancing Persian language understanding.

Problem

Research questions and friction points this paper is trying to address.

Underrepresentation of Persian in text embedding research

Need for improved Persian NLP task accuracy

Lack of Persian datasets for supervised and unsupervised training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel Persian text embedding model Hakim

Introduces three new datasets for training

BERT-based baseline and RetroMAE for retrieval

🔎 Similar Papers

No similar papers found.