🤖 AI Summary
This paper studies the $k$-means clustering problem on a set of $n$ line segments in $mathbb{R}^d$, where the objective is to select $k$ centers minimizing the integral of squared Euclidean distances from all points along the segments to their nearest center. Addressing the lack of efficient approximation algorithms with theoretical guarantees for this continuous geometric setting, we propose the first $varepsilon$-coreset construction applicable to arbitrary line segment inputs: the coreset size is $O(log^2 n)$, construction time is $O(n)$, and it supports extensions including robustness to outliers (via $M$-estimators), balanced clustering, and unique cluster assignment. Our method leverages a novel integral distance metric $D(S,x)$ and integrates weighted sampling with geometric space partitioning. Empirical evaluation on video object tracking demonstrates <1% clustering accuracy loss and several-fold speedup over baseline methods, confirming both strong theoretical guarantees and practical efficiency.
📝 Abstract
We study the $k$-means problem for a set $mathcal{S} subseteq mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X subseteq mathbb{R}^d$ that minimize
$D(mathcal{S},X) := sum_{S in mathcal{S}} min_{x in X} D(S,x)$, where $D(S,x) := int_{p in S} |p - x| dp$
measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $varepsilon > 0$, an $varepsilon$-coreset is a weighted subset $C subseteq mathbb{R}^d$ that approximates $D(mathcal{S},X)$ within a factor of $1 pm varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $varepsilon$, it produces a coreset of size $O(log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.