Linear time small coresets for k-mean clustering of segments with applications

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

This paper studies the $k$-means clustering problem on a set of $n$ line segments in $mathbb{R}^d$, where the objective is to select $k$ centers minimizing the integral of squared Euclidean distances from all points along the segments to their nearest center. Addressing the lack of efficient approximation algorithms with theoretical guarantees for this continuous geometric setting, we propose the first $varepsilon$-coreset construction applicable to arbitrary line segment inputs: the coreset size is $O(log^2 n)$, construction time is $O(n)$, and it supports extensions including robustness to outliers (via $M$-estimators), balanced clustering, and unique cluster assignment. Our method leverages a novel integral distance metric $D(S,x)$ and integrates weighted sampling with geometric space partitioning. Empirical evaluation on video object tracking demonstrates <1% clustering accuracy loss and several-fold speedup over baseline methods, confirming both strong theoretical guarantees and practical efficiency.

Technology Category

Application Category

📝 Abstract

We study the $k$-means problem for a set $mathcal{S} subseteq mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X subseteq mathbb{R}^d$ that minimize $D(mathcal{S},X) := sum_{S in mathcal{S}} min_{x in X} D(S,x)$, where $D(S,x) := int_{p in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $varepsilon > 0$, an $varepsilon$-coreset is a weighted subset $C subseteq mathbb{R}^d$ that approximates $D(mathcal{S},X)$ within a factor of $1 pm varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $varepsilon$, it produces a coreset of size $O(log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

Problem

Research questions and friction points this paper is trying to address.

Efficient k-means clustering for continuous line segments

Constructing small coresets to approximate segment clustering

Enabling scalable computation with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear time coreset construction for segments

Handles arbitrary input segments with guarantees

Enables efficient streaming and distributed computation

🔎 Similar Papers

No similar papers found.