🤖 AI Summary
In multi-party data collaboration without a trusted third party, privacy-preserving feature selection remains challenging. Method: This paper proposes the first single-party outsourced feature selection framework based on fully homomorphic encryption (FHE), enabling a data owner to delegate encrypted data to an untrusted cloud server. All operations—including information gain computation, ciphertext-domain sorting, and Top-k feature selection—are performed entirely over encrypted data, eliminating reliance on any trusted third party. Contribution/Results: Unlike conventional multi-party secure protocols, our approach relaxes trust assumptions while simultaneously protecting both data and model privacy. Theoretical analysis shows time complexity of O(kn log³n) and space complexity of O(kn). Experiments demonstrate significant efficiency advantages even on small-scale datasets, while guaranteeing rigorous security and full interpretability of the selection process.
📝 Abstract
Feature selection is a technique that extracts a meaningful subset from a set of features in training data. When the training data is large-scale, appropriate feature selection enables the removal of redundant features, which can improve generalization performance, accelerate the training process, and enhance the interpretability of the model. This study proposes a privacy-preserving computation model for feature selection. Generally, when the data owner and analyst are the same, there is no need to conceal the private information. However, when they are different parties or when multiple owners exist, an appropriate privacy-preserving framework is required. Although various private feature selection algorithms, they all require two or more computing parties and do not guarantee security in environments where no external party can be fully trusted. To address this issue, we propose the first outsourcing algorithm for feature selection using fully homomorphic encryption. Compared to a prior two-party algorithm, our result improves the time and space complexity O(kn^2) to O(kn log^3 n) and O(kn), where k and n denote the number of features and data samples, respectively. We also implemented the proposed algorithm and conducted comparative experiments with the naive one. The experimental result shows the efficiency of our method even with small datasets.