π€ AI Summary
This study addresses the lack of standardized evaluation protocols that hinder fair assessment of genomic foundation modelsβ performance and generalization. To this end, the authors introduce GENEB, a large-scale diagnostic benchmark that systematically evaluates frozen representations from 40 models across 100 tasks under a unified probing protocol, spanning 13 functional categories and supporting few-shot settings. This framework enables, for the first time, category-aware, fine-grained, and controllable multidimensional comparisons, revealing the instability of aggregate leaderboards and inherent trade-offs across tasks. Key findings indicate substantial variation in model rankings across functional categories, limited and inconsistent gains from increased model scale, and a more decisive influence of architectural design and alignment between pretraining data and downstream tasks than parameter count alone.
π Abstract
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.