Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Short text clustering suffers from semantic sparsity, weak discriminative representation learning, and unreliable pseudo-labels. To address these challenges, we propose POTA—a novel framework that (1) integrates instance-level attention into optimal transport (OT) constraints to jointly model semantic consistency among instances and global cluster-level structure; (2) introduces semantic-regularized OT solving to adaptively estimate imbalanced cluster distributions and generate robust pseudo-labels; and (3) leverages these pseudo-labels to guide contrastive representation learning. Extensive experiments demonstrate that POTA achieves significant improvements over state-of-the-art methods across multiple short text clustering benchmarks, particularly under severe class imbalance. The framework exhibits strong generalization capability in heterogeneous and skewed data regimes. Our implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable extbf{P}seudo-labeling via extbf{O}ptimal extbf{T}ransport with extbf{A}ttention for Short Text Clustering ( extbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, extbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making extbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate extbf{POTA} outperforms state-of-the-art methods. The code is available at: href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.

Problem

Research questions and friction points this paper is trying to address.

Short Text Clustering

Data Mining

Accuracy Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic Optimal Transport

Attention Mechanism

Short Text Clustering

🔎 Similar Papers

Text Clustering as Classification with LLMs