Celebrity Profiling on Short Urdu Text using Twitter Followers' Feed

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
To address the scarcity of demographic attribute analysis (gender, age, occupation, fame) for low-resource languages like Urdu, this paper proposes a multi-dimensional demographic profiling method leveraging fans’ social media texts. We construct and utilize the first Urdu Twitter fan-generated short-text dataset, and conduct end-to-end prediction experiments integrating logistic regression, SVM, random forest, CNN, and LSTM models. Crucially, we innovatively treat fan-generated content as a proxy feature source for celebrities—bypassing reliance on self-reported user profiles or metadata—and thereby bridge a critical gap in demographic profiling research for low-resource languages. Experimental results demonstrate that gender prediction achieves the highest performance (accuracy = 0.65, cumulative rank = 0.65), while age, occupation, and fame estimation attain moderate accuracy, validating the efficacy and feasibility of this paradigm under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract
Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language.
Problem

Research questions and friction points this paper is trying to address.

Predicting celebrity demographics from Urdu Twitter followers' text
Applying machine learning to Urdu celebrity profiling using tweets
Addressing low-resource language gap in social media demographic analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using follower tweets for celebrity profiling
Applying machine learning to Urdu language analysis
Comparing multiple algorithms for demographic prediction