🤖 AI Summary
This paper identifies a systematic representational gap in large language models (LLMs) regarding Nigerian Pidgin (Naija): current multilingual LLMs effectively cover only West African Pidgin English (WAPE), despite significant typological differences between Naija and WAPE in word order, lexicon, and grammar—rendering them non-interchangeable. Method: We conduct the first comprehensive investigation through statistical linguistic analysis, construction of cross-variety parallel corpora, machine translation evaluation (BLEU/chrF), and zero-/few-shot prompting experiments. Contribution/Results: Results demonstrate over 60% performance degradation of mainstream LLMs on Naija-specific tasks. We propose a targeted fine-tuning framework for low-resource creoles, grounded in empirical diagnostics of dialectal divergence. This work advances theoretical foundations and practical methodologies for equitable multilingual LLM development and inclusive technological access for marginalized languages.
📝 Abstract
Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are now various platforms publishing exclusively in Naija such as Naija Wikipedia. However, it is hard to distinguish by non-native from a larger pidgin languages spoken across West Africa known as West African Pidgin English (WAPE) -- which is more simplied and understandable by wider audience in Ghana, Nigeria, and Cameroon. BBC news platform publishes exclusively in WAPE to cater for several countries in West Africa. In our paper, we show through statistical analyses and Machine Translation experiments that these two creole varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is under-represented in Generative AI, and it is hard to teach LLMs with few examples.