π€ AI Summary
Existing text embedding models exhibit poor robustness on Southeast Asian (SEA) languages and are often difficult to reproduce due to reliance on non-public data. This work proposes the first open and reproducible text embedding framework specifically designed for SEA languages, trained exclusively on publicly available data. Through a systematic investigation of the impact of data composition, training objectives, and encoder initialization on embedding robustness, the study demonstrates that a contrastive learning objective combined with a multilingual base encoder yields state-of-the-art performance on the SEA-BED benchmark. The resulting framework provides a fully transparent, reproducible, and analyzable suite of embedding models, establishing a reliable foundation for future research and applications in low-resource SEA language processing.
π Abstract
Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.