🤖 AI Summary
This work addresses the limitations of conventional clustering algorithms—namely, their reliance on pre-defined similarity metrics and poor generalizability to multimodal and zero-shot settings—by proposing an in-context unsupervised clustering method leveraging large language models (LLMs). The core methodology introduces in-context learning to unsupervised clustering for the first time: it constructs a data relational graph from LLM self-attention matrices and refines clustering via spectral clustering jointly optimized with a next-token prediction loss. Through prompt engineering, the framework achieves unified encoding and clustering across text, numerical, and image modalities, enabling zero-shot clustering and text-conditioned image clustering. Experiments demonstrate strong zero-shot capability on text-encoded numerical data; fine-tuning yields significant performance gains on both numerical and image benchmarks. Moreover, attention patterns inherently reflect cluster structure, thereby expanding the frontier of LMs in unsupervised learning.
📝 Abstract
We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.