Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The prohibitively high computational cost and poor reproducibility of biological foundation model (BioFM) pretraining—driven by massive datasets and parameter counts—hinder academic research. Method: We propose a posterior influence-guided data pruning framework, introducing the first subset-aware self-influence metric and two biologically grounded sequence selection strategies: Top-k Influence and Coverage-Centric Influence. Contribution/Results: Our analysis systematically reveals substantial redundancy in RNA and protein sequences. On RNA-FM and ESM-C, our method achieves >99% pruning while significantly outperforming random baselines; a core set comprising only 10% of the original data surpasses a randomly sampled subset ten times larger in downstream performance. Strong generalization is demonstrated across RNA structural prediction and protein functional annotation tasks. This framework substantially lowers the pretraining barrier, enhancing accessibility, efficiency, and reproducibility of BioFM research.

Technology Category

Application Category

📝 Abstract
Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of BioFM pretraining via data pruning
Developing influence-guided pruning for biological sequence datasets
Enhancing accessibility and sustainability in biological AI research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Influence-guided data pruning reduces BioFM pretraining costs.
Subset-based self-influence estimates sample importance efficiently.
Top-k and Coverage-Centric strategies outperform random selection.
🔎 Similar Papers
No similar papers found.
Y
Yifan Wu
The Chinese University of Hong Kong
J
Jiyue Jiang
The Chinese University of Hong Kong
Xichen Ye
Xichen Ye
Fudan University
Machine Learning
Y
Yiqi Wang
Shanghai University
C
Chang Zhou
The Chinese University of Hong Kong
Yitao Xu
Yitao Xu
PhD student, EPFL
Artificial IntelligenceComputer VisionMachine Learning
J
Jiayang Chen
The Chinese University of Hong Kong
H
He Hu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Weizhong Zhang
Weizhong Zhang
Fudan University
Machine LearningDeep LearningOptimization
C
Cheng Jin
Fudan University
J
Jiao Yuan
Guangzhou National Laboratory
Y
Yu Li
The Chinese University of Hong Kong