Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study challenges the necessity of task-specific fine-tuning in genomic prediction. We propose a fine-tuning-free paradigm centered on fixed sequence embeddings extracted from pretrained DNA language models (DNABERT-2, HyenaDNA), augmented with lightweight handcrafted features—including z-curve representations and GC content—and fed into an efficient classifier. By eliminating fine-tuning, our approach avoids performance degradation under distributional shift and significantly improves out-of-distribution generalization. On enhancer classification, it achieves 0.68 accuracy (+significant gain), reduces inference latency by 88%, and cuts carbon emissions by over 8×. For non-TATA promoter classification, it attains 0.85 accuracy while reducing carbon footprint by 22×. These results establish a new baseline for genomic model deployment—more generalizable, computationally efficient, and environmentally sustainable.

Technology Category

Application Category

📝 Abstract

Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.

Problem

Research questions and friction points this paper is trying to address.

Investigates if task-specific fine-tuning is always necessary for DNA models

Compares embedding-based methods vs fine-tuning for genomic prediction tasks

Demonstrates embedding methods improve efficiency and generalizability in genomics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fixed embeddings from DNA models

Combines embeddings with lightweight classifiers

Reduces inference time and carbon emissions

🔎 Similar Papers

No similar papers found.