Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates performance disparities between Llama3-70B and DeepSeekR1-distill-Llama3-70B on six zero-shot biomedical text classification tasks spanning social media and clinical electronic health records. Method: We employ standardized zero-shot prompting and rigorously evaluate macro-F1, precision, recall, and 95% confidence intervals across all tasks. Contribution/Results: We present the first cross-source, multi-task, statistically rigorous benchmark for zero-shot LLM-based biomedical classification. Our analysis reveals a task-dependent precision–recall trade-off: DeepSeekR1 achieves higher precision on most tasks but exhibits substantial performance volatility—failing catastrophically on several—whereas supervised methods remain uniquely robust when labeled data are available. These findings provide empirically grounded guidance for model selection and methodology design in zero-shot biomedical NLP.

Technology Category

Application Category

📝 Abstract
This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.
Problem

Research questions and friction points this paper is trying to address.

Compare Llama3 and DeepSeekR1 on biomedical text classification tasks.
Evaluate zero-shot performance on social media and clinical note datasets.
Assess precision-recall trade-offs for model selection in health-related tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares Llama3-70B and DeepSeekR1-distill-Llama3-70B models
Evaluates zero-shot performance on biomedical text tasks
Highlights precision-recall trade-offs in model selection
🔎 Similar Papers
No similar papers found.
Y
Yuting Guo
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA
Abeed Sarker
Abeed Sarker
Emory University School of Medicine
Natural Language ProcessingBiomedical InformaticsHealth Data ScienceApplied Machine Learning