Kd-tree Based Wasserstein Distance Approximation for High-Dimensional Data

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of scaling Wasserstein distance computation to large-scale retrieval in high dimensions, where the cubic time complexity of exact computation is prohibitive and existing tree-based approximations—such as quadtrees—suffer from limited depth, high preprocessing overhead, and insufficient accuracy in high-dimensional settings. To overcome these limitations, the authors propose kd-Flowtree, the first method to integrate kd-trees into Wasserstein distance approximation by constructing optimal transport on a kd-tree embedding. This approach significantly reduces preprocessing time and mitigates the curse of dimensionality due to sparsity. Theoretical analysis provides a data-size-independent probabilistic upper bound on nearest neighbor search accuracy. Empirical results on real high-dimensional datasets demonstrate that kd-Flowtree simultaneously outperforms existing methods in both efficiency and approximation accuracy.

Technology Category

Application Category

📝 Abstract

The Wasserstein distance is a discrepancy measure between probability distributions, defined by an optimal transport problem. It has been used for various tasks such as retrieving similar items in high-dimensional images or text data. In retrieval applications, however, the Wasserstein distance is calculated repeatedly, and its cubic time complexity with respect to input size renders it unsuitable for large-scale datasets. Recently, tree-based approximation methods have been proposed to address this bottleneck. For example, the Flowtree algorithm computes transport on a quadtree and evaluates cost using the ground metric, and clustering-tree approaches have been reported to achieve high accuracy. However, these existing trees often incur significant construction time for preprocessing, and crucially, standard quadtrees cannot grow deep enough in high-dimensional spaces, resulting in poor approximation accuracy. In this paper, we propose kd-Flowtree, a kd-tree-based Wasserstein distance approximation method that uses a kd-tree for data embedding. Since kd-trees can grow sufficiently deep and adaptively even in high-dimensional cases, kd-Flowtree is capable of maintaining good approximation accuracy for such cases. In addition, kd-trees can be constructed quickly than quadtrees, which contributes to reducing the computation time required for nearest neighbor search, including preprocessing. We provide a probabilistic upper bound on the nearest-neighbor search accuracy of kd-Flowtree, and show that this bound is independent of the dataset size. In the numerical experiments, we demonstrated that kd-Flowtree outperformed the existing Wasserstein distance approximation methods for retrieval tasks with real-world data.

Problem

Research questions and friction points this paper is trying to address.

Wasserstein distance

high-dimensional data

tree-based approximation

computational complexity

retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

kd-tree

Wasserstein distance

high-dimensional data