🤖 AI Summary
Existing CHILDES corpora suffer from annotation heterogeneity and limited scale, hindering standardized dependency parsing and cross-linguistic research in child language acquisition. Method: We propose a “gold + silver” collaborative annotation paradigm: (1) manually constructing 48k high-quality Universal Dependencies (UD) v2-compliant dependency trees across 11 CHILDES subcorpora, covering both child and caregiver utterances; and (2) automatically generating 1M silver-standard annotations via rule-based and model-based approaches. We further perform transcription alignment, noise cleaning, and cross-speaker/corpus dependency consistency verification. Contribution/Results: This work delivers the first officially released CHILDES-driven UD treebank—uniquely standardized under UD v2 for child language. It enables systematic UD adoption and large-scale expansion in this domain, significantly enhancing data support for child language parsing, acquisition modeling, and cross-lingual dependency transfer research.
📝 Abstract
CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences. We validate existing gold-standard annotations under the UD v2 framework and provide an additional 1M silver-standard sentences, offering a consistent resource for computational and linguistic research.