π€ AI Summary
Low-resource constraints hinder Urdu multimodal named entity recognition (MNER), due to the absence of annotated datasets and standardized baselines. Method: We introduce Twitter2015-Urdu, the first benchmark dataset for Urdu MNER, and propose U-MNERβa lightweight cross-modal fusion framework. U-MNER jointly leverages Urdu-BERT for textual feature extraction and ResNet for visual feature extraction, incorporates a linguistically grounded modality alignment mechanism tailored to Urdu syntactic properties, and integrates a novel cross-modal interaction module. Additionally, we design a rule-based fine-grained entity annotation protocol. Contribution/Results: Experiments demonstrate that U-MNER achieves state-of-the-art performance on Twitter2015-Urdu, significantly outperforming existing methods. This work establishes the first standardized baseline for Urdu MNER and provides both critical data resources and a reproducible technical framework to advance MNER research for low-resource languages.
π Abstract
The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.