Positive region preserved random sampling: an efficient feature selection method for massive data

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and resource constraints in feature selection for large-scale data, this paper proposes an efficient feature reduction method based on sampling and rough set theory. The method integrates positive-region-preserving sampling, rough dependency computation, and discernibility matrix optimization to enhance scalability and efficiency. Its key contributions are threefold: (1) it quantifies the discriminative power of a feature subset via the proportion of discernible object pairs; (2) it enables *a priori* probabilistic estimation of a lower bound on the subset’s discriminative ability, facilitating controllable reduction; and (3) it achieves both theoretical guarantees and practical performance. Experiments on 11 datasets—including four large-scale ones—demonstrate that the method delivers near-optimal reductions within minutes on a standard PC. Empirically, the achieved discriminative ability consistently exceeds its theoretically estimated lower bound, confirming the approach’s efficiency, interpretability, and reliability.

Technology Category

Application Category

📝 Abstract
Selecting relevant features is an important and necessary step for intelligent machines to maximize their chances of success. However, intelligent machines generally have no enough computing resources when faced with huge volume of data. This paper develops a new method based on sampling techniques and rough set theory to address the challenge of feature selection for massive data. To this end, this paper proposes using the ratio of discernible object pairs to all object pairs that should be distinguished to measure the discriminatory ability of a feature set. Based on this measure, a new feature selection method is proposed. This method constructs positive region preserved samples from massive data to find a feature subset with high discriminatory ability. Compared with other methods, the proposed method has two advantages. First, it is able to select a feature subset that can preserve the discriminatory ability of all the features of the target massive data set within an acceptable time on a personal computer. Second, the lower boundary of the probability of the object pairs that can be discerned using the feature subset selected in all object pairs that should be distinguished can be estimated before finding reducts. Furthermore, 11 data sets of different sizes were used to validate the proposed method. The results show that approximate reducts can be found in a very short period of time, and the discriminatory ability of the final reduct is larger than the estimated lower boundary. Experiments on four large-scale data sets also showed that an approximate reduct with high discriminatory ability can be obtained in reasonable time on a personal computer.
Problem

Research questions and friction points this paper is trying to address.

Efficient feature selection for massive data sets
Preserving discriminatory ability with limited computing resources
Estimating lower boundary of feature subset discernibility probability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive region preserved random sampling technique
Ratio-based discriminatory ability measurement
Efficient feature selection for massive data
🔎 Similar Papers
No similar papers found.
H
Hexiang Bai
School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China.
Deyu Li
Deyu Li
Utrecht University
InnovationIndustry EvolutionEconomic GeographyGlobal Value ChainIndustrial policy
Jiye Liang
Jiye Liang
Shanxi University
Y
Yanhui Zhai
School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China.