WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing multilingual QA datasets suffer from limited language coverage, low annotation quality, and insufficient scale. To address these issues, this paper introduces the first large-scale, schema.org–structured multilingual FAQ construction framework, covering 75 languages and 96 million naturally occurring question-answer pairs. It supports 20 monolingual dense retrieval benchmarks and generates high-quality bilingual parallel corpora across 1,000+ language pairs. The framework innovatively incorporates an LLM-driven bit-aligned translation evaluation and optimization method to significantly improve bilingual corpus accuracy. It further integrates fine-tuned XLM-RoBERTa, near-duplicate detection, and advanced bit-pair mining techniques, yielding substantial performance gains on WebFAQ and zero-shot cross-benchmark evaluations. All data, models, and code are publicly released on GitHub and Hugging Face to advance research in multilingual dense retrieval.

Technology Category

Application Category

📝 Abstract

We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond WebFAQ - to other multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora spanning over 1000 language pairs using state-of-the-art bitext mining and automated LLM-assessed translation evaluation. Due to our advanced, automated method of bitext dataset generation, the resulting bilingual corpora demonstrate higher translation quality compared to similar datasets. WebFAQ and all associated resources are publicly available on GitHub and HuggingFace.

Problem

Research questions and friction points this paper is trying to address.

Develops WebFAQ, a multilingual QA dataset for dense retrieval.

Enhances multilingual retrieval models via fine-tuning with WebFAQ.

Creates high-quality bilingual corpora using advanced bitext mining.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multilingual QA datasets creation

Fine-tuning XLM-RoBERTa for retrieval performance

Automated bilingual corpora generation with LLM

🔎 Similar Papers

No similar papers found.