Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing research on English embedded clauses heavily relies on manually constructed examples, lacking large-scale, naturally occurring corpora. This work introduces the first end-to-end framework for automatic embedded clause identification in natural language, integrating constituency parsing with domain-specific heuristic rules to efficiently extract embedded clauses from the Dolma open-source corpus. We construct and publicly release the Golden Embedded Clause Set (GECS)—a rigorously human-verified dataset comprising hundreds of thousands of naturally occurring embedded clauses, currently the largest such resource. GECS fills a critical gap in linguistic empirical research by providing high-quality, large-scale, naturally distributed data on embedded structures. It serves as a foundational resource for testing syntactic theories, modeling language acquisition, and advancing NLP downstream tasks requiring fine-grained clausal analysis.

Technology Category

Application Category

📝 Abstract

For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.

Problem

Research questions and friction points this paper is trying to address.

Detect English embedded clauses in large text data

Overcome reliance on artificial language examples

Provide statistical insights from natural language corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses constituency parsing for clause detection

Applies parsing heuristics for annotation

Extracts clauses from large-scale text data

🔎 Similar Papers

No similar papers found.