Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on English embedded clauses heavily relies on manually constructed examples, lacking large-scale, naturally occurring corpora. This work introduces the first end-to-end framework for automatic embedded clause identification in natural language, integrating constituency parsing with domain-specific heuristic rules to efficiently extract embedded clauses from the Dolma open-source corpus. We construct and publicly release the Golden Embedded Clause Set (GECS)—a rigorously human-verified dataset comprising hundreds of thousands of naturally occurring embedded clauses, currently the largest such resource. GECS fills a critical gap in linguistic empirical research by providing high-quality, large-scale, naturally distributed data on embedded structures. It serves as a foundational resource for testing syntactic theories, modeling language acquisition, and advancing NLP downstream tasks requiring fine-grained clausal analysis.

Technology Category

Application Category

📝 Abstract
For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.
Problem

Research questions and friction points this paper is trying to address.

Detect English embedded clauses in large text data
Overcome reliance on artificial language examples
Provide statistical insights from natural language corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses constituency parsing for clause detection
Applies parsing heuristics for annotation
Extracts clauses from large-scale text data
🔎 Similar Papers
No similar papers found.
I
Iona Carslaw
School of Informatics, University of Edinburgh
S
Sivan Milton
School of Informatics, University of Edinburgh
N
Nicolas Navarre
School of Informatics, University of Edinburgh
Ciyang Qing
Ciyang Qing
Department of Linguistics and English Language, University of Edinburgh
W
Wataru Uegaki
School of Philosophy, Psychology & Language Sciences, University of Edinburgh