🤖 AI Summary
Balancing interpretability and accuracy remains challenging in topic modeling of unstructured social media text. Method: This paper proposes an end-to-end topic extraction framework tailored to autism-community tweets, introducing a novel three-stage paradigm: “embedding compression → clustering → agent-based chain-of-thought (CoT) generation.” It integrates Tweet-BERT embeddings, UMAP dimensionality reduction, non-negative matrix factorization (NMF), and a dual-LLM协同 mechanism—comprising topic generation and quality verification—enhanced by agent-oriented CoT prompting for semantic refinement and interpretable representation. Results: Evaluated on real-world data, the framework achieves a 32% improvement in topic coherence and attains a Cohen’s Kappa of 0.81 in human evaluation. It operates with minimal supervision and exhibits strong cross-domain transferability, establishing a generalizable, robust paradigm for domain-specific community discourse analysis.
📝 Abstract
Thematic analysis of social media posts provides a major understanding of public discourse, yet traditional methods often struggle to capture the complexity and nuance of unstructured, large-scale text data. This study introduces a novel methodology for thematic analysis that integrates tweet embeddings from pre-trained language models, dimensionality reduction using and matrix factorization, and generative AI to identify and refine latent themes. Our approach clusters compressed tweet representations and employs generative AI to extract and articulate themes through an agentic Chain of Thought (CoT) prompting, with a secondary LLM for quality assurance. This methodology is applied to tweets from the autistic community, a group that increasingly uses social media to discuss their experiences and challenges. By automating the thematic extraction process, the aim is to uncover key insights while maintaining the richness of the original discourse. This autism case study demonstrates the utility of the proposed approach in improving thematic analysis of social media data, offering a scalable and adaptable framework that can be applied to diverse contexts. The results highlight the potential of combining machine learning and Generative AI to enhance the depth and accuracy of theme identification in online communities.