🤖 AI Summary
To address regression model training under scarce labeled data, this paper proposes a passive, model-agnostic density-aware farthest-point sampling (DA-FPS) method. DA-FPS optimizes the feature-space distribution of the training set to minimize an upper bound on prediction error: leveraging Lipschitz continuity theory, it constructs an estimable surrogate based on weighted fill distance and proves that DA-FPS asymptotically approaches its optimal solution. Unlike conventional sampling strategies, DA-FPS jointly ensures spatial coverage and local density adaptation. Experiments across two regression models and three benchmark datasets demonstrate that DA-FPS significantly reduces mean absolute error, consistently outperforming baseline methods—including random sampling, standard FPS, and core-set selection—with robust and stable gains. The method thus provides an efficient and reliable pre-processing strategy for small-sample regression, serving as a principled foundation for subsequent active learning pipelines.
📝 Abstract
We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set, a quantity we can estimate simply by considering the data features. We introduce "Density-Aware Farthest Point Sampling" (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.