AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding scarcity of high-quality syntactic resources for African languages in natural language processing by introducing AfriSUD, the first large-scale dependency treebank covering nine sub-Saharan languages. Developed within the Surface-Syntactic Universal Dependencies framework and validated by native speakers, AfriSUD explicitly incorporates key typological features such as agglutination and tonality. Through a community-driven approach, the project systematically evaluates non-Transformer baselines, multilingual pretrained models, and large language models on this resource, revealing substantial limitations of current methods in capturing the syntactic structures of African languages. The findings underscore a pronounced “grammar gap” and establish AfriSUD as a unified, high-quality benchmark to support future research in this underrepresented linguistic domain.
📝 Abstract
Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.
Problem

Research questions and friction points this paper is trying to address.

African languages
syntactic annotation
treebank
NLP resource gap
linguistic diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

AfriSUD
Universal Dependencies
African languages
dependency parsing
syntax gap
H
Happy Buzaaba
Princeton University; Laboratory for Artificial Intelligence, Princeton University
C
Cheikh Mouhamadou Bamba Dione
Gaston Berger University
David Ifeoluwa Adelani
David Ifeoluwa Adelani
McGill University and Mila - Quebec AI Institute and Canada CIFAR AI Chair
Natural language processingMultilingualityMultilingual NLPAfricaNLPLow-resource NLP
Sylvain Kahane
Sylvain Kahane
University Paris Nanterre, Modyco & CNRS / Institut Universitaire de France
syntaxdependency grammartreebankquantitative typologyspoken language
K
Kim Gerdes
Paris-Saclay University
B
Bruno Guillaume
CNRS; Inria; LORIA; Université de Lorraine
K
Kevin Guan
Princeton University
A
Aremu Anuoluwapo
University of Trento
Naome A. Etori
Naome A. Etori
Department of Computer Science and Engineering, University of Minnesota-Twin Cities
AINLPHealthcareHCIComputational Social Science
Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
Bayero University, Kano, & Google DeepMind Academic Fellow at Imperial College London
Natural Language ProcessingSentiment AnalysisAfricaNLPLow-resource NLPMultilinguality
U
Utitofon Inyang
Binghamton University
P
Peter Nabende
Makerere University
D
David Sabiiti Bamutura
Mbarara University of Science and Technology; Chalmers University of Technology
A
Andiswa Bukula
Penn State University
C
Chinedu Uchechukwu
Nnamdi Azikiwe University
R
Rooweither Mabuya
South African Centre for Digital Language Resources
I
Idris Akinade
University of Ibadan
C
Christiane Fellbaum
Princeton University