Text Clustering as Classification with LLMs

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

High annotation costs and reliance on embedding fine-tuning and complex similarity computations hinder traditional text clustering. To address these challenges, this paper proposes a large language model (LLM)-native clustering paradigm that reformulates clustering as a zero-shot or few-shot classification task. Leveraging prompt engineering, the method enables LLMs (e.g., GPT, Qwen) to autonomously generate, normalize, and merge semantic labels—entirely bypassing embedding fine-tuning and explicit clustering algorithms. This is the first approach to achieve end-to-end text clustering without embedding optimization or dedicated clustering modules. Extensive experiments on multiple benchmark datasets demonstrate performance on par with or superior to state-of-the-art embedding-based methods. The implementation is publicly available, confirming strong generalizability and practical utility.

Technology Category

Application Category

📝 Abstract

Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Text Clustering

Cost-Effective Classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Simplified Text Clustering

Automated Text Classification

🔎 Similar Papers

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models