π€ AI Summary
Short text clustering suffers from semantic sparsity, weak discriminative representation learning, and unreliable pseudo-labels. To address these challenges, we propose POTAβa novel framework that (1) integrates instance-level attention into optimal transport (OT) constraints to jointly model semantic consistency among instances and global cluster-level structure; (2) introduces semantic-regularized OT solving to adaptively estimate imbalanced cluster distributions and generate robust pseudo-labels; and (3) leverages these pseudo-labels to guide contrastive representation learning. Extensive experiments demonstrate that POTA achieves significant improvements over state-of-the-art methods across multiple short text clustering benchmarks, particularly under severe class imbalance. The framework exhibits strong generalization capability in heterogeneous and skewed data regimes. Our implementation is publicly available.
π Abstract
Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable extbf{P}seudo-labeling via extbf{O}ptimal extbf{T}ransport with extbf{A}ttention for Short Text Clustering ( extbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, extbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making extbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate extbf{POTA} outperforms state-of-the-art methods. The code is available at: href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.