BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A comprehensive benchmark for evaluating end-to-end scientific reasoning capabilities of LLM agents in bioinformatics is lacking. Method: We introduce BixBench, the first open-source benchmark tailored to computational biology, comprising 50+ realistic analytical scenarios and nearly 300 open-ended questions. It systematically assesses agent performance across data exploration, multi-step tool invocation, dynamic planning, and complex result interpretation. We explicitly define and quantify both reasoning and execution fidelity of LLM agents on authentic bioinformatic tasks. Contribution/Results: Our evaluation reveals critical bottlenecks: state-of-the-art models achieve only 17% accuracy on open-ended questions and perform no better than random chance on multiple-choice items. We release an open-source agent framework built on GPT-4o and Claude 3.5 Sonnet, enabling reproducible, multi-step analytical assessment—filling a key methodological gap and advancing rigorous evaluation and development of biological AI agents.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge tasks, towards more practical work such as literature review and experimental planning. Bioinformatics is a domain where fully autonomous AI-driven discovery may be near, but no extensive benchmarks for measuring progress have been introduced to date. We therefore present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis with nearly 300 associated open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses. We evaluate the performance of two frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent framework we open source. We find that even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting. By exposing the current limitations of frontier models, we hope BixBench can spur the development of agents capable of conducting rigorous bioinformatic analysis and accelerate scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Develop a benchmark for LLM-based agents in bioinformatics.
Measure LLM performance in complex biological data analysis.
Evaluate LLMs' ability to interpret nuanced bioinformatics results.
Innovation

Methods, ideas, or system contributions that make the work stand out.

BixBench: benchmark for LLM-based bioinformatics agents
Custom agent framework evaluates LLM performance
Open-answer questions measure multi-step analysis
🔎 Similar Papers
No similar papers found.