🤖 AI Summary
Cross-domain constituent parsing is hindered by the scarcity of multi-domain treebanks. To address this, we propose a novel LLM-based paradigm for reverse treebank generation: given incomplete phrase-structure trees annotated with domain-specific keywords, a large language model (LLM) completes lexical items while preserving syntactic structure, yielding high-quality, controllably generated cross-domain treebanks. We further introduce phrase-span-level contrastive pretraining to enhance syntactic structure awareness. Our method reformulates syntactic parsing inversely as structure-guided text completion—enabling, for the first time, controllable treebank generation with explicit grammatical consistency optimization. Evaluated on the MCTB benchmark across five domains, our approach achieves state-of-the-art average performance, significantly outperforming supervised transfer and data-augmentation baselines. This validates both the efficacy of generative treebank construction and structured pretraining for cross-domain constituent parsing.
📝 Abstract
Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.