๐ค AI Summary
Existing DNA foundation models exhibit strong sequence representation capabilities but suffer from limited multi-step biological reasoning and lack transparent, interpretable internal mechanisms. To address these limitations, we propose DNA-LLM, a cross-modal collaborative architecture that enables large language models (LLMs) to perform end-to-end, multi-step, and interpretable reasoning directly over raw DNA sequencesโthe first of its kind. Our method integrates DNA foundation models with LLMs, incorporates biologically grounded reinforcement learning, and introduces step-level mechanistic attribution for traceable reasoning. Moreover, it supports zero-shot generalization to unseen biological entities. Evaluated on KEGG disease pathway prediction, DNA-LLM achieves 97% accuracy (+9% absolute improvement), and delivers an average 15% gain across tasks including variant effect prediction. All code, datasets, and model weights are publicly released.
๐ Abstract
Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason