Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Cross-domain constituent parsing is hindered by the scarcity of multi-domain treebanks. To address this, we propose a novel LLM-based paradigm for reverse treebank generation: given incomplete phrase-structure trees annotated with domain-specific keywords, a large language model (LLM) completes lexical items while preserving syntactic structure, yielding high-quality, controllably generated cross-domain treebanks. We further introduce phrase-span-level contrastive pretraining to enhance syntactic structure awareness. Our method reformulates syntactic parsing inversely as structure-guided text completion—enabling, for the first time, controllable treebank generation with explicit grammatical consistency optimization. Evaluated on the MCTB benchmark across five domains, our approach achieves state-of-the-art average performance, significantly outperforming supervised transfer and data-augmentation baselines. This validates both the efficacy of generative treebank construction and structured pretraining for cross-domain constituent parsing.

Technology Category

Application Category

📝 Abstract

Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited multi-domain constituency treebank availability

Improving LLM performance on constituency parsing tasks

Enhancing cross-domain parsing via contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM back generation for treebank creation

Span-level contrastive learning pre-training

Cross-domain parsing with incomplete trees

🔎 Similar Papers

No similar papers found.

Authors to Follow