🤖 AI Summary
African languages—representing a significant portion of the world’s linguistic diversity—are severely underrepresented in multimodal AI, particularly in image captioning, due to scarce annotated data and limited model support.
Method: This work introduces the first large-scale vision-to-language framework for 20 African languages. It constructs a semantically aligned, high-quality multilingual image-caption dataset; designs a dynamic quality assurance pipeline integrating context-aware translation, model ensembling (SigLIP + NLLB-200), and adaptive token replacement; and develops a unified, 0.5B-parameter vision-to-text architecture optimized for low-resource settings.
Contribution/Results: We release the first open-source, African-language–focused image captioning dataset and corresponding pre-trained models. Our framework establishes a new multilingual generation paradigm that balances accuracy and scalability, achieving substantial performance gains on cross-modal tasks for low-resource languages. This advances inclusive, equitable multimodal AI development and sets a foundation for future research in under-resourced language modalities.
📝 Abstract
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.