Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-domain constituent parsing is hindered by the scarcity of multi-domain treebanks. To address this, we propose a novel LLM-based paradigm for reverse treebank generation: given incomplete phrase-structure trees annotated with domain-specific keywords, a large language model (LLM) completes lexical items while preserving syntactic structure, yielding high-quality, controllably generated cross-domain treebanks. We further introduce phrase-span-level contrastive pretraining to enhance syntactic structure awareness. Our method reformulates syntactic parsing inversely as structure-guided text completion—enabling, for the first time, controllable treebank generation with explicit grammatical consistency optimization. Evaluated on the MCTB benchmark across five domains, our approach achieves state-of-the-art average performance, significantly outperforming supervised transfer and data-augmentation baselines. This validates both the efficacy of generative treebank construction and structured pretraining for cross-domain constituent parsing.

Technology Category

Application Category

📝 Abstract
Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited multi-domain constituency treebank availability
Improving LLM performance on constituency parsing tasks
Enhancing cross-domain parsing via contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM back generation for treebank creation
Span-level contrastive learning pre-training
Cross-domain parsing with incomplete trees
🔎 Similar Papers
No similar papers found.
Peiming Guo
Peiming Guo
Phd Student, Harbin Institute of Technology (Shenzhen)
Natural Language ProcessingLarge Language ModelCode Intelligence
Meishan Zhang
Meishan Zhang
Associate Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingComputational LinguisticsSyntax ParsingSentiment AnalysisMachine
J
Jianling Li
School of New Media and Communication, Tianjin University, China
M
Min Zhang
Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China
Y
Yue Zhang
School of Engineering, Westlake University, China