BiMax: Bidirectional MaxSim Score for Document-Level Alignment

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the challenge of balancing accuracy and efficiency in cross-lingual document alignment on large-scale web data, this paper introduces Bidirectional Maximum Similarity (BiMax), a lightweight and scalable metric based on multilingual sentence embeddings. BiMax computes document-level similarity via bidirectional retrieval—matching each document in one language to its most similar counterpart in the other, and vice versa—thereby avoiding computationally expensive optimal transport (OT) optimization. On the WMT16 benchmark, BiMax achieves accuracy comparable to state-of-the-art OT-based methods while delivering approximately 100× inference speedup. Its design enables efficient, hierarchical cross-lingual knowledge mining and has been integrated into the open-source toolkit EmbDA. Both the implementation and pretrained models are publicly released to facilitate reproducibility and further research.

Technology Category

Application Category

📝 Abstract

Document alignment is necessary for the hierarchical mining (Bañón et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzmán, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (https://github.com/EternalEdenn/EmbDA).

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency in cross-lingual document alignment

Achieving high accuracy while accelerating similarity computation

Evaluating multilingual sentence embeddings for document matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional MaxSim score for document similarity

Improves efficiency over Optimal Transport method

Achieves comparable accuracy with 100x speed increase

🔎 Similar Papers

No similar papers found.