BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

๐Ÿ“… 2025-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing DNA foundation models exhibit strong sequence representation capabilities but suffer from limited multi-step biological reasoning and lack transparent, interpretable internal mechanisms. To address these limitations, we propose DNA-LLM, a cross-modal collaborative architecture that enables large language models (LLMs) to perform end-to-end, multi-step, and interpretable reasoning directly over raw DNA sequencesโ€”the first of its kind. Our method integrates DNA foundation models with LLMs, incorporates biologically grounded reinforcement learning, and introduces step-level mechanistic attribution for traceable reasoning. Moreover, it supports zero-shot generalization to unseen biological entities. Evaluated on KEGG disease pathway prediction, DNA-LLM achieves 97% accuracy (+9% absolute improvement), and delivers an average 15% gain across tasks including variant effect prediction. All code, datasets, and model weights are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason
Problem

Research questions and friction points this paper is trying to address.

Enabling deep interpretable reasoning from genomic data
Integrating DNA foundation models with LLMs for biological understanding
Improving accuracy in disease pathway and variant prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates DNA foundation model with LLM
Uses supervised fine-tuning and reinforcement learning
Improves accuracy in biological reasoning benchmarks
๐Ÿ”Ž Similar Papers
No similar papers found.