The Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The optimal bootstrap sampling ratio (BR) for random forest (RF) regression remains poorly understood, with BR = 1.0 conventionally adopted without data-adaptive justification. Method: We systematically investigate BR’s impact on RF regression performance across 39 heterogeneous regression datasets, evaluating 16 RF configurations under repeated two-fold cross-validation for BR ∈ [0.2, 5.0]. Contribution/Results: We discover that intrinsic data characteristics—particularly global structural strength versus local target variance and noise level—critically govern the optimal BR: strong global structure favors high BR (>1.0), whereas high local variance or noise benefits low BR (<1.0), revealing BR’s pivotal role in bias–variance trade-off. Only four datasets achieve minimal mean squared error (MSE) at BR = 1.0; adaptive BR tuning yields statistically significant MSE reduction on average. This work establishes the first interpretable, data-driven mapping between dataset features and optimal BR, providing a practical, generalizable hyperparameter optimization paradigm for RF regression.

Technology Category

Application Category

📝 Abstract
Random Forests (RFs) typically train each tree on a bootstrap sample of the same size as the training set, i.e., bootstrap rate (BR) equals 1.0. We systematically examine how varying BR from 0.2 to 5.0 affects RF performance across 39 heterogeneous regression datasets and 16 RF configurations, evaluating with repeated two-fold cross-validation and mean squared error. Our results demonstrate that tuning the BR can yield significant improvements over the default: the best setup relied on BR leq 1.0 for 24 datasets, BR > 1.0 for 15, and BR = 1.0 was optimal in 4 cases only. We establish a link between dataset characteristics and the preferred BR: datasets with strong global feature-target relationships favor higher BRs, while those with higher local target variance benefit from lower BRs. To further investigate this relationship, we conducted experiments on synthetic datasets with controlled noise levels. These experiments reproduce the observed bias-variance trade-off: in low-noise scenarios, higher BRs effectively reduce model bias, whereas in high-noise settings, lower BRs help reduce model variance. Overall, BR is an influential hyperparameter that should be tuned to optimize RF regression models.
Problem

Research questions and friction points this paper is trying to address.

Investigating bootstrap sampling rate impact on Random Forest regression performance
Analyzing dataset characteristics to determine optimal bootstrap sampling rates
Exploring bias-variance trade-off in Random Forests through controlled noise experiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically tuning bootstrap sampling rate
Linking dataset characteristics to optimal sampling rate
Investigating bias-variance trade-off via synthetic datasets
🔎 Similar Papers
No similar papers found.
M
Michał Iwaniuk
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
M
Mateusz Jarosz
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
B
Bartłomiej Borycki
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
B
Bartosz Jezierski
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
J
Jan Cwalina
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
S
Stanisław Kaźmierczak
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
Jacek Mańdziuk
Jacek Mańdziuk
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
Computational IntelligenceArtificial General IntelligenceAI for Social GoodVisual Reasoning