Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

Low-resource code-switched languages (e.g., Spanish–English, Spanish–Guarani) suffer from degraded dependency parsing performance due to scarce annotated data. To address this, we propose LLM-driven BiLingua Parser—a novel few-shot prompting framework leveraging large language models for Universal Dependencies (UD) annotation in code-switched text. It integrates cross-lingual dependency structure modeling with expert-in-the-loop verification. We release the first UD treebank for Spanish–Guarani, enabling systematic analysis of syntactic patterns at code-switching points and their contextual constraints. After human post-editing, our parser achieves a labeled attachment score (LAS) of 95.29%, substantially outperforming state-of-the-art multilingual parsers. All datasets and source code are publicly released, providing a reusable methodology and infrastructure for building syntactic resources for low-resource code-switched languages.

Technology Category

Application Category

📝 Abstract

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaran'i data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaran'i UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject

Problem

Research questions and friction points this paper is trying to address.

Analyzing syntactic structure in code-switched low-resource languages

Improving multilingual parser performance on mixed-language input

Generating Universal Dependencies annotations for code-switched text

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based prompt framework for code-switched text

First Spanish-Guarani UD-parsed corpus released

Combines few-shot LLM prompting with expert review

🔎 Similar Papers

No similar papers found.