Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering

πŸ“… 2025-01-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Short text clustering suffers from semantic sparsity, weak discriminative representation learning, and unreliable pseudo-labels. To address these challenges, we propose POTAβ€”a novel framework that (1) integrates instance-level attention into optimal transport (OT) constraints to jointly model semantic consistency among instances and global cluster-level structure; (2) introduces semantic-regularized OT solving to adaptively estimate imbalanced cluster distributions and generate robust pseudo-labels; and (3) leverages these pseudo-labels to guide contrastive representation learning. Extensive experiments demonstrate that POTA achieves significant improvements over state-of-the-art methods across multiple short text clustering benchmarks, particularly under severe class imbalance. The framework exhibits strong generalization capability in heterogeneous and skewed data regimes. Our implementation is publicly available.

Technology Category

Application Category

πŸ“ Abstract
Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable extbf{P}seudo-labeling via extbf{O}ptimal extbf{T}ransport with extbf{A}ttention for Short Text Clustering ( extbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, extbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making extbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate extbf{POTA} outperforms state-of-the-art methods. The code is available at: href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.
Problem

Research questions and friction points this paper is trying to address.

Short Text Clustering
Data Mining
Accuracy Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic Optimal Transport
Attention Mechanism
Short Text Clustering
πŸ”Ž Similar Papers
No similar papers found.
Zhihao Yao
Zhihao Yao
Tsinghua University
HCI
J
Jixuan Yin
School of Harbin Engineering University, Institute of Intelligent Systems Science and Engineering, Harbin, China
B
Bo Li
School of Harbin Engineering University, Institute of Intelligent Systems Science and Engineering, Harbin, China