🤖 AI Summary
Clustering high-dimensional discrete data is often hindered by high computational cost, sensitivity to sparsity, and limited methodological applicability. This work proposes a deterministic dimensionality reduction framework that compresses binary, categorical, or count-based high-dimensional discrete data into low-dimensional continuous representations via weighted positional encoding. The resulting mapping is injective, preserving the discriminative structure of the original data; under mild conditions, the compressed variables approximately follow a Gaussian distribution while maintaining inter-cluster distances, thereby ensuring identifiable clustering structures. Empirical evaluations on real-world datasets—including infant names and microbiome profiles—demonstrate that the method achieves high clustering accuracy and substantially outperforms mainstream dimensionality reduction techniques in computational efficiency, offering both strong theoretical guarantees and practical utility.
📝 Abstract
High-dimensional discrete data arise in many contemporary applications, including genomics, microbiome research, survey studies, and digital behavioral analysis. Clustering such data remains challenging because existing methods are often computationally demanding, sensitive to sparsity and discreteness, or designed for specific data types. We propose a deterministic dimension-reduction framework for clustering high-dimensional discrete observations. The method compresses each observation into a low-dimensional continuous representation through weighted sums defined by a scaled positional encoding, yielding a numerically stable transformation applicable to binary, categorical, and count-valued data. We establish several theoretical properties of the proposed compression. The mapping is injective, ensuring that distinct observations remain distinct after compression. Under mild regularity conditions, the compressed variables admit an approximate Gaussian representation, providing a theoretical basis for model-based clustering in the compressed space. We further show that separation between cluster centroids is preserved under compression, implying that location-driven cluster structure remains identifiable after dimension reduction. Extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings. The proposed approach is also computationally efficient, providing substantial speed improvements over commonly used dimension-reduction techniques often used in conjunction with clustering. Applications to Irish baby-name records and microbiome data further illustrate its practical utility. The proposed framework offers a scalable, computationally efficient, and broadly applicable approach to clustering high-dimensional discrete data.