🤖 AI Summary
To address the challenge of balancing accuracy and efficiency in cross-lingual document alignment on large-scale web data, this paper introduces Bidirectional Maximum Similarity (BiMax), a lightweight and scalable metric based on multilingual sentence embeddings. BiMax computes document-level similarity via bidirectional retrieval—matching each document in one language to its most similar counterpart in the other, and vice versa—thereby avoiding computationally expensive optimal transport (OT) optimization. On the WMT16 benchmark, BiMax achieves accuracy comparable to state-of-the-art OT-based methods while delivering approximately 100× inference speedup. Its design enables efficient, hierarchical cross-lingual knowledge mining and has been integrated into the open-source toolkit EmbDA. Both the implementation and pretrained models are publicly released to facilitate reproducibility and further research.
📝 Abstract
Document alignment is necessary for the hierarchical mining (Bañón et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzmán, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (https://github.com/EternalEdenn/EmbDA).