π€ AI Summary
Existing multi-view clustering methods are often domain-specific and rely on computationally expensive two-stage paradigms (representation learning followed by clustering), limiting generalization to heterogeneous data (e.g., images and tabular data). This paper proposes the first end-to-end deep multi-view clustering framework that jointly optimizes cross-view representation fusion and cluster assignment. Key contributions include: (i) a permutation-based self-supervised Canonical Correlation Analysis (CCA) objective, theoretically proven to asymptotically approximate supervised Linear Discriminant Analysis (LDA) representations, with a derived bound on pseudo-label error; and (ii) a pseudo-label consistency constraint to enhance robustness. Extensive experiments on 10 standard benchmarks demonstrate significant improvements over state-of-the-art methods, validating the frameworkβs generalizability, robustness, and cross-modal adaptability.
π Abstract
Fusing information from different modalities can enhance data analysis tasks, including clustering. However, existing multi-view clustering (MVC) solutions are limited to specific domains or rely on a suboptimal and computationally demanding two-stage procedure of representation and clustering. We propose an end-to-end deep learning-based MVC framework for general data (image, tabular, etc.). Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. Concurrently, we learn cluster assignments by identifying consistent pseudo-labels across multiple views. We demonstrate the effectiveness of our model using ten MVC benchmark datasets. Theoretically, we show that our model approximates the supervised linear discrimination analysis (LDA) representation. Additionally, we provide an error bound induced by false-pseudo label annotations.