🤖 AI Summary
Current biomedical foundation models (BioFMs) operate in embedding spaces disjoint from those of large language models (LLMs), hindering effective cross-modal reasoning. To address this, we propose a two-stage semantic alignment framework: first, leveraging pre-trained BioFMs as multimodal encoders; second, employing lightweight, modality-specific projection layers to map diverse biomedical modalities—including biomedical text, molecular structures, and single-cell representations—into the LLM’s embedding space, without requiring LLM fine-tuning. Integrated with instruction tuning, our approach supports cross-modal question answering, zero-shot cell-type annotation, and interpretable dialogue. Experiments demonstrate superior performance over larger LLM baselines across multiple biomedical tasks, significantly improving reasoning accuracy and output interpretability. To our knowledge, this is the first work enabling plug-and-play, high-compatibility multimodal biomedical joint reasoning without LLM adaptation.
📝 Abstract
Recent advances in large language models (LLMs) and biomedical foundation models (BioFMs) have achieved strong results in biological text reasoning, molecular modeling, and single-cell analysis, yet they remain siloed in disjoint embedding spaces, limiting cross-modal reasoning. We present BIOVERSE (Biomedical Vector Embedding Realignment for Semantic Engagement), a two-stage approach that adapts pretrained BioFMs as modality encoders and aligns them with LLMs through lightweight, modality-specific projection layers. The approach first aligns each modality to a shared LLM space through independently trained projections, allowing them to interoperate naturally, and then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning. By unifying raw biomedical data with knowledge embedded in LLMs, the approach enables zero-shot annotation, cross-modal question answering, and interactive, explainable dialogue. Across tasks spanning cell-type annotation, molecular description, and protein function reasoning, compact BIOVERSE configurations surpass larger LLM baselines while enabling richer, generative outputs than existing BioFMs, establishing a foundation for principled multi-modal biomedical reasoning.