🤖 AI Summary
This work addresses the lack of large-scale, multidimensional longitudinal datasets on Telegram—a critical bottleneck for cross-domain research in information diffusion and extremist content analysis. We introduce the largest publicly available Telegram dataset to date, comprising metadata for 500,000 channels and message-level metadata for 120 million posts across 71,000 public channels. Crucially, we systematically integrate forward-network topology with multidimensional enhancements: language identification, diurnal activity modeling, and joint regular-expression–NER entity extraction. The dataset enables fine-grained discourse modeling and fully reproducible research. Empirically, it improves channel influence prediction accuracy by +18.3% over prior baselines. Moreover, it supports diverse downstream tasks—including information diffusion modeling, extremist content tracking, and multilingual community evolution analysis—thereby advancing scalable, evidence-based studies of Telegram’s socio-technical ecosystem.
📝 Abstract
Telegram is a globally popular instant messaging platform known for its strong emphasis on security, privacy, and unique social networking features. It has recently emerged as the host for various cross-domain analysis and research works, such as social media influence, propaganda studies, and extremism. This paper introduces TeleScope, an extensive dataset suite that, to our knowledge, is the largest of its kind. It comprises metadata for about 500K Telegram channels and downloaded message metadata for about 71K public channels, accounting for around 120M crawled messages. We also release channel connections and user interaction data built using Telegram's message-forwarding feature to study multiple use cases, such as information spread and message forwarding patterns. In addition, we provide data enrichments, such as language detection, active message posting periods for each channel, and Telegram entities extracted from messages, that enable online discourse analysis beyond what is possible with the original data alone. The dataset is designed for diverse applications, independent of specific research objectives, and sufficiently versatile to facilitate the replication of social media studies comparable to those conducted on platforms like X (formerly Twitter)