X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical issue of conflicting evidence between Chinese and English sources in multilingual retrieval-augmented generation (RAG), which often leads to erroneous model outputs. The authors present the first systematic diagnosis of this problem, introducing X-RAMDocs-ZHEN—a controlled bilingual Chinese–English conflict benchmark—and propose X-MADAM-RAG, an interpretable framework that structurally resolves evidence conflicts through document-wise candidate extraction, visible evidence repair, deterministic grouping, and conflict-aware aggregation. Experimental results demonstrate that the method achieves a strict accuracy of 0.9667 and a conflict-aware success rate of 0.9767 on the controlled benchmark. However, performance drops significantly in naturalized evaluations without explicit templates, revealing the misleading reliance of current approaches on templated inputs and highlighting bottlenecks in document-level evidence extraction.
📝 Abstract
Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented generation
evidence conflict
multilingual RAG
Chinese-English contradiction
hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
multilingual evidence conflict
interpretable pipeline
controlled benchmark
candidate extraction
🔎 Similar Papers
No similar papers found.