🤖 AI Summary
Constructing disease-specific digital cohorts from social media remains challenging due to noisy, heterogeneous user-generated content and lack of clinically grounded semantic filtering. Method: This paper proposes a focused user screening method leveraging the *metric backbone*—a minimal subgraph preserving shortest-path distances—of a biomedical knowledge graph. We construct a weighted knowledge graph by integrating biomedical terminologies (via dictionary mapping) with multi-source social media texts, then automatically extract the metric backbone to unsupervisedly identify users whose linguistic patterns align with clinical semantics. Contribution/Results: Evaluated across X, Instagram, Reddit, and the Epilepsy Foundation of America (EFA) forum, epilepsy cohort identification reveals that users from condition-focused platforms contribute 87% of backbone edges versus only 12% from general-purpose platforms. Manual annotation shows 73% of non-backbone users misuse clinical terms, confirming the backbone’s efficacy in suppressing semantic noise. Our approach introduces a novel, semantics-driven, unsupervised filtering paradigm for digital cohort construction.
📝 Abstract
Social media data allows researchers to construct large digital cohorts to study the interplay between human behavior and medical treatment.Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. To filter relevant users on any social media, we have developed a general method and tested it on epilepsy discourse. We analyzed the text from posts by users who mention epilepsy drugs at least once in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We used a curated medical terminology dictionary to generate a knowledge graph (KG) from each social media site, whereby nodes represent terms, and edge weights denote the strength of association between pairs of terms in the collected text. Our method is based on computing the metric backbone of each KG, which yields the subgraph of edges that participate in shortest paths. By comparing the subset of users who contribute to the backbone to the subset who do not, we show that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrate that users who do not contribute to the backbone are much more likely to use dictionary terms in a manner inconsistent with their biomedical meaning and are rightly excluded from the cohort of interest.