🤖 AI Summary
Traditional coreset methods for large-scale nonparametric Bayesian inference fail because they rely on likelihood functions, which are often intractable or undefined in nonparametric settings.
Method: This paper proposes a variational coreset construction framework based on predictive distribution matching—replacing KL-divergence minimization with optimization of predictive consistency between the randomized posterior induced by a weighted subset and the full-data posterior. Implemented via a predictive recursion algorithm, it accommodates any samplable nonparametric prior (e.g., random partition processes, Dirichlet process mixtures) without requiring explicit likelihood evaluation.
Contribution/Results: Experiments on random partitioning and density estimation demonstrate that the method achieves near-full-data predictive performance using only 1–5% of the data, while accelerating computation by 10–100×. It is the first coreset construction approach for nonparametric Bayesian inference that is both computationally efficient and theoretically grounded, offering broad applicability across nonparametric models.
📝 Abstract
Modern data analysis often involves massive datasets with hundreds of thousands of observations, making traditional inference algorithms computationally prohibitive. Coresets are selection methods designed to choose a smaller subset of observations while maintaining similar learning performance. Conventional coreset approaches determine these weights by minimizing the Kullback-Leibler (KL) divergence between the likelihood functions of the full and weighted datasets; as a result, this makes them ill-posed for nonparametric models, where the likelihood is often intractable. We propose an alternative variational method which employs randomized posteriors and finds weights to match the unknown posterior predictive distributions conditioned on the full and reduced datasets. Our approach provides a general algorithm based on predictive recursions suitable for nonparametric priors. We evaluate the performance of the proposed coreset construction on diverse problems, including random partitions and density estimation.