PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current computational phenotyping heavily relies on labor-intensive manual annotation, resulting in low efficiency and poor reproducibility. To address this, we propose PHEONA—the first large language model (LLM) evaluation framework tailored for clinical computational phenotyping extraction—focused on concept classification tasks in real-world data, such as respiratory support therapies. PHEONA innovatively integrates medical context specificity to establish a three-dimensional evaluation paradigm encompassing clinical plausibility, interpretability, and task robustness. It incorporates knowledge-guided zero-shot and few-shot prompting strategies alongside a human-in-the-loop evaluation protocol. Evaluated on acute respiratory failure phenotyping classification, PHEONA substantially reduces reliance on manual annotation while achieving high accuracy. Our results demonstrate the feasibility, effectiveness, and clinical applicability of LLM-driven computational phenotyping, offering a scalable, reproducible, and clinically grounded alternative to conventional annotation-dependent approaches.

Technology Category

Application Category

📝 Abstract

Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for computational phenotyping efficiency

Addressing manual data review limitations in biomedicine

Improving accuracy in Acute Respiratory Failure classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed PHEONA framework for LLM evaluation

Applied LLMs to computational phenotyping tasks

Achieved high accuracy in concept classification

🔎 Similar Papers

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models