Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the challenge of clustering categorical data, where the absence of intrinsic ordering among attribute values often hinders the capture of meaningful semantic structures, thereby limiting clustering performance. To overcome this limitation, the work proposes the first integration of large language models (LLMs) into categorical clustering by leveraging LLMs to generate semantic descriptions of attribute values, which are then used to construct enhanced embeddings. An attention mechanism is further designed to adaptively fuse these external semantic representations with the original data, yielding a semantic-aware representation that effectively mitigates the unreliability of co-occurrence statistics in low-sample regimes. Evaluated on eight benchmark datasets, the proposed method significantly outperforms seven state-of-the-art approaches, achieving average improvements of 19–27% in clustering performance and demonstrating the efficacy and novelty of LLM-driven semantic augmentation in unsupervised categorical clustering.

Technology Category

Application Category

📝 Abstract

Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at https://github.com/develop-yang/ARISE

Problem

Research questions and friction points this paper is trying to address.

categorical data clustering

semantic gap

similarity measurement

large language models

attribute values

Innovation

Methods, ideas, or system contributions that make the work stand out.

categorical data clustering

semantic gap

large language models