🤖 AI Summary
This study addresses the problem of simultaneously identifying latent cluster structures, estimating model parameters, and selecting the number of clusters from continuous observations under an unsupervised setting. To this end, the authors propose a general semiparametric clustered elliptical distribution model and develop a two-stage algorithm: an initial clustering and parameter estimation is obtained via weighted least squares with a separation penalty, followed by alternating pseudo-maximum likelihood estimation and cluster reassignment. The method achieves, for the first time, asymptotic semiparametric efficiency, asymptotically optimal clustering accuracy, and consistent selection of the number of clusters under general semiparametric clustered elliptical distributions, thereby overcoming the restrictive Gaussian assumption commonly adopted in existing approaches. Theoretical analysis establishes the consistency and asymptotic efficiency of the estimators and the asymptotic optimality of clustering accuracy, while simulations and real-data analyses demonstrate superior finite-sample performance.
📝 Abstract
We introduce a general semiparametric clusterwise elliptical distribution to assess how latent cluster structure shapes continuous outcomes. Using a subjectwise representation, we first estimate cluster-specific mean vectors and a cluster-invariant scatter matrix by minimizing a weighted sum of squares criterion augmented with a separation penalty; we provide an initialization scheme and a computational algorithm with guaranteed convergence. This initial estimator consistently recovers the true clusters and seeds a second phase that alternates pseudo-maximum likelihood (or pseudo-maximum marginal likelihood) estimation with cluster reassignment, yielding asymptotic semiparametric efficiency and an optimal clustering that asymptotically maximizes the probability of correct membership. We also propose a semiparametric information criterion for selecting the number of clusters. Monte Carlo simulations and empirical applications demonstrate strong finite-sample performance and practical value.