MGen: Millions of Naturally Occurring Generics in Context

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing research on generic sentences is hindered by the absence of large-scale, diverse corpora of naturally occurring generics. To address this, we introduce MGenβ€”the first large-scale, naturally occurring generic corpus comprising over 4 million sentences, covering 11 categories of quantificational expressions, all extracted from authentic long-text contexts such as full web pages and academic papers. Methodologically, we propose a hybrid rule- and model-based pipeline for automatic extraction and cleaning that preserves original discourse context and enables fine-grained annotation of quantifier types. MGen substantially enhances lexical, syntactic, and pragmatic diversity while improving ecological validity; linguistic analysis reveals generics are typically longer and frequently employed for population-level generalizations. Publicly released, MGen constitutes the largest and richest natural generic resource to date, enabling robust research in genericity identification, language modeling, and genericity quantification.

Technology Category

Application Category

πŸ“ Abstract
MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
Problem

Research questions and friction points this paper is trying to address.

Collects 4 million naturally occurring generic sentences from diverse sources
Analyzes linguistic features of generics across 11 quantifiers and contexts
Enables large-scale computational research on genericity with public dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracted millions of generic sentences from diverse texts
Included long context documents from websites and papers
Covered 11 different quantifiers for comprehensive analysis
πŸ”Ž Similar Papers
No similar papers found.