Text Clustering as Classification with LLMs

๐Ÿ“… 2024-09-30
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
High annotation costs and reliance on embedding fine-tuning and complex similarity computations hinder traditional text clustering. To address these challenges, this paper proposes a large language model (LLM)-native clustering paradigm that reformulates clustering as a zero-shot or few-shot classification task. Leveraging prompt engineering, the method enables LLMs (e.g., GPT, Qwen) to autonomously generate, normalize, and merge semantic labelsโ€”entirely bypassing embedding fine-tuning and explicit clustering algorithms. This is the first approach to achieve end-to-end text clustering without embedding optimization or dedicated clustering modules. Extensive experiments on multiple benchmark datasets demonstrate performance on par with or superior to state-of-the-art embedding-based methods. The implementation is publicly available, confirming strong generalizability and practical utility.

Technology Category

Application Category

๐Ÿ“ Abstract
Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Text Clustering
Cost-Effective Classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Simplified Text Clustering
Automated Text Classification
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chen Huang
Singapore University of Technology and Design
G
Guoxiu He
School of Economics and Management, East China Normal University