🤖 AI Summary
This work addresses the longstanding scarcity of high-quality syntactic resources for African languages in natural language processing by introducing AfriSUD, the first large-scale dependency treebank covering nine sub-Saharan languages. Developed within the Surface-Syntactic Universal Dependencies framework and validated by native speakers, AfriSUD explicitly incorporates key typological features such as agglutination and tonality. Through a community-driven approach, the project systematically evaluates non-Transformer baselines, multilingual pretrained models, and large language models on this resource, revealing substantial limitations of current methods in capturing the syntactic structures of African languages. The findings underscore a pronounced “grammar gap” and establish AfriSUD as a unified, high-quality benchmark to support future research in this underrepresented linguistic domain.
📝 Abstract
Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.