Two-sample comparison through additive tree models for density ratios

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the nonparametric estimation of the ratio between two probability density functions. We propose a novel density-ratio estimation method based on an additive tree model inspired by Bayesian Additive Regression Trees (BART). Our approach reformulates density-ratio estimation as a supervised learning problem via a carefully designed balanced loss function, enabling both optimization-based and generalized Bayesian modeling perspectives. To our knowledge, this is the first work to enable full Bayesian inference and uncertainty quantification for BART-like models in density-ratio estimation: we jointly fit the model using forward gradient boosting and backward conjugate prior sampling, treating the balanced loss as a pseudo-likelihood. The method maintains high accuracy even in high-dimensional, complex distributions. Empirical evaluation on microbiome generative model assessment demonstrates its reliability and capability to faithfully characterize estimation uncertainty.

Technology Category

Application Category

📝 Abstract
The ratio of two densities characterizes their differences. We consider learning the density ratio given i.i.d. observations from each of the two distributions. We propose additive tree models for the density ratio along with efficient algorithms for training these models using a new loss function called the balancing loss. With this loss, additive tree models for the density ratio can be trained using algorithms original designed for supervised learning. Specifically, they can be trained from both an optimization perspective that parallels tree boosting and from a (generalized) Bayesian perspective that parallels Bayesian additive regression trees (BART). For the former, we present two boosting algorithms -- one based on forward-stagewise fitting and the other based on gradient boosting, both of which produce a point estimate for the density ratio function. For the latter, we show that due to the loss function's resemblance to an exponential family kernel, the new loss can serve as a pseudo-likelihood for which conjugate priors exist, thereby enabling effective generalized Bayesian inference on the density ratio using backfitting samplers designed for BART. The resulting uncertainty quantification on the inferred density ratio is critical for applications involving high-dimensional and complex distributions in which uncertainty given limited data can often be substantial. We provide insights on the balancing loss through its close connection to the exponential loss in binary classification and to the variational form of f-divergence, in particular that of the squared Hellinger distance. Our numerical experiments demonstrate the accuracy of the proposed approach while providing unique capabilities in uncertainty quantification. We demonstrate the application of our method in a case study involving assessing the quality of generative models for microbiome compositional data.
Problem

Research questions and friction points this paper is trying to address.

Estimating density ratios between two distributions efficiently
Developing additive tree models with balancing loss function
Enabling uncertainty quantification for high-dimensional data analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Additive tree models for density ratio learning
Balancing loss enables supervised learning algorithms
Generalized Bayesian inference with conjugate priors
🔎 Similar Papers
No similar papers found.
Naoki Awaya
Naoki Awaya
Waseda University
Statistics
Y
Yuliang Xu
Department of Statistics, University of Chicago
L
Li Ma
Department of Statistics, University of Chicago