🤖 AI Summary
Telegram channels serve dual roles in information dissemination—functioning both as public announcement platforms and vectors for extremist content propagation. To address this, we construct the largest publicly available Telegram channel dataset to date, comprising 120,979 channels and 403 million messages across 12 major languages. Methodologically, we design a distributed crawling framework leveraging the Telegram API, integrated with multilingual detection, LDA topic modeling, and network graph analysis to systematically uncover mechanisms underlying suspicious news and conspiracy theory diffusion. Our key contributions include: (1) the first publicly released complete dataset of open Telegram channels; (2) identification and annotation of the Sabmyk conspiracy network—comprising 2,147 interlinked channels; (3) open-sourcing a fully reproducible data acquisition and analytical toolchain; and (4) revealing topic distributions (e.g., politics, health, conspiracy theories) and cross-channel coordinated dissemination patterns, particularly within English-language channels.
📝 Abstract
Telegram is one of the most popular instant messaging apps in today's digital age. In addition to providing a private messaging service, Telegram, with its channels, represents a valid medium for rapidly broadcasting content to a large audience (COVID-19 announcements), but, unfortunately, also for disseminating radical ideologies and coordinating attacks (Capitol Hill riot). This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages, making it the largest collection of Telegram channels to the best of our knowledge. After a brief introduction to the data collection process, we analyze the languages spoken within our dataset and the topic covered by English channels. Finally, we discuss some use cases in which our dataset can be extremely useful to understand better the Telegram ecosystem, as well as to study the diffusion of questionable news. In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.