BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing DNA foundation models exhibit strong sequence representation capabilities but suffer from limited multi-step biological reasoning and lack transparent, interpretable internal mechanisms. To address these limitations, we propose DNA-LLM, a cross-modal collaborative architecture that enables large language models (LLMs) to perform end-to-end, multi-step, and interpretable reasoning directly over raw DNA sequences—the first of its kind. Our method integrates DNA foundation models with LLMs, incorporates biologically grounded reinforcement learning, and introduces step-level mechanistic attribution for traceable reasoning. Moreover, it supports zero-shot generalization to unseen biological entities. Evaluated on KEGG disease pathway prediction, DNA-LLM achieves 97% accuracy (+9% absolute improvement), and delivers an average 15% gain across tasks including variant effect prediction. All code, datasets, and model weights are publicly released.

Technology Category

Application Category

📝 Abstract

Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason

Problem

Research questions and friction points this paper is trying to address.

Enabling deep interpretable reasoning from genomic data

Integrating DNA foundation models with LLMs for biological understanding

Improving accuracy in disease pathway and variant prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates DNA foundation model with LLM

Uses supervised fine-tuning and reinforcement learning

Improves accuracy in biological reasoning benchmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow