GenoTEX: A Benchmark for Automated Gene Expression Data Analysis in Alignment with Bioinformaticians

📅 2024-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Identifying disease-associated genes from gene expression data remains heavily reliant on manual curation and lacks scalability. Method: We propose GenoTEX, the first automated evaluation benchmark for this task—covering data selection, preprocessing, and statistical analysis—and provide expert-annotated code and results. We formalize domain-expert practices as quantifiable LLM-agent tasks and introduce GenoAgent, a self-correcting multi-agent framework integrating LLMs, workflow orchestration, differential expression analysis, GO/KEGG enrichment, and expert-knowledge alignment. Contribution/Results: On GenoTEX, GenoAgent achieves end-to-end automated analysis with significantly reduced human intervention. Error analysis identifies semantic understanding and domain-logic modeling as primary bottlenecks. This work establishes a reproducible, evaluable benchmark and methodological paradigm for biomedical AI agents.

Technology Category

Application Category

📝 Abstract
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at https://github.com/Liu-Hy/GenoTex.
Problem

Research questions and friction points this paper is trying to address.

Automating gene expression data analysis with LLMs
Reducing manual effort in disease gene identification
Standardizing benchmarks for bioinformatics pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agents automate gene expression analysis
GenoTEX benchmark includes expert-curated annotations
Multi-step programming workflow enables self-correction
🔎 Similar Papers
No similar papers found.
H
Haoyang Liu
School of Information Sciences, University of Illinois at Urbana-Champaign
S
Shuyu Chen
Y
Ye Zhang
Haohan Wang
Haohan Wang
School of Information Sciences, University of Illinois Urbana-Champaign
Computational BiologyAgentic AIAI4ScienceAI security