🤖 AI Summary
This paper addresses the problem that modern document embeddings (e.g., GloVe, doc2vec) induce negative pairwise similarities, which corrupt the construction of the normalized Laplacian matrix and degrade spectral clustering performance. We systematically analyze the underlying mechanisms by which such negative similarities impair graph-based clustering. To mitigate this issue, we propose a general similarity rectification framework comprising six strategies—including offset, truncation, and spectral shift—enabling, for the first time, the adaptation of word-vector-space interpretability methods to global embeddings like GloVe. Experiments demonstrate that our framework significantly improves the stability and success rate of normalized Laplacian spectral clustering with GloVe. Three rectification variants consistently enhance clustering accuracy across multiple benchmark datasets. Moreover, the framework extends the applicability of existing interpretability techniques to modern text embeddings, thereby strengthening both the robustness and interpretability of spectral clustering for textual documents.
📝 Abstract
This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.