SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the underexplored yet critical task of document-level semantic divergence identification in multilingual settings. To this end, the authors introduce CL-SD—the first naturalistic, document-granularity, cross-lingual benchmark dataset—covering English–German, English–French, and English–Italian parallel texts, with 224 manually annotated token-level divergences. CL-SD bridges three key gaps in prior work: language coverage (cross-lingual), text granularity (document-level), and data provenance (real-world texts). The authors conduct systematic experiments using both open- and closed-source large language models and encoders under multiple fine-tuning paradigms, evaluating performance against human annotations as the gold standard. Results reveal substantially lower model accuracy on CL-SD compared to monolingual or sentence-level semantic alignment tasks, underscoring the fundamental difficulty of cross-lingual document-level semantic alignment.

Technology Category

Application Category

📝 Abstract

Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.

Problem

Research questions and friction points this paper is trying to address.

Develops a cross-lingual benchmark for token-level semantic difference recognition

Evaluates LLMs and encoder models on this new multilingual document-level dataset

Reveals a performance gap in models for naturalistic cross-lingual semantic comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual dataset for semantic difference recognition

Token-level human annotations in multiple languages

Benchmark evaluation of LLMs and encoder models

🔎 Similar Papers

No similar papers found.