🤖 AI Summary
This work addresses the underexplored yet critical task of document-level semantic divergence identification in multilingual settings. To this end, the authors introduce CL-SD—the first naturalistic, document-granularity, cross-lingual benchmark dataset—covering English–German, English–French, and English–Italian parallel texts, with 224 manually annotated token-level divergences. CL-SD bridges three key gaps in prior work: language coverage (cross-lingual), text granularity (document-level), and data provenance (real-world texts). The authors conduct systematic experiments using both open- and closed-source large language models and encoders under multiple fine-tuning paradigms, evaluating performance against human annotations as the gold standard. Results reveal substantially lower model accuracy on CL-SD compared to monolingual or sentence-level semantic alignment tasks, underscoring the fundamental difficulty of cross-lingual document-level semantic alignment.
📝 Abstract
Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.