PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of leveraging unstructured patient-generated text, which contains rich socio-experiential information yet lacks standardized representation for patient-centered research and clinical improvement. To bridge this gap, the authors introduce PVminer—the first benchmark specifically designed for structured extraction of the patient voice—and propose a supervised fine-tuning approach using large language models to jointly extract codes, subcodes, and supporting evidence spans. Experimental results demonstrate that even with minimal fine-tuning data, the method substantially outperforms prompt-based baselines, achieving F1 scores of 83.82%, 80.74%, and 87.03% across multiple datasets, thereby confirming its efficacy and practical utility in structuring patient narratives for downstream applications.

Technology Category

Application Category

📝 Abstract
Motivation: Patient-generated text contains critical information about patients'lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP
Problem

Research questions and friction points this paper is trying to address.

Patient-Generated Text
Patient Voice
Structured Extraction
Clinical NLP
Health Equity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Supervised Fine-Tuning
Patient-Generated Text
Structured Extraction
Clinical NLP
🔎 Similar Papers
No similar papers found.
S
Samah Fodeh
Yale University, New Haven, CT, USA.
Linhai Ma
Linhai Ma
Yale University
Deep learningMedical signal/image analysisConcurrency
G
Ganesh Puthiaraju
Yale University, New Haven, CT, USA.
S
Srivani Talakokkul
Yale University, New Haven, CT, USA.
A
Afshan Khan
Yale University, New Haven, CT, USA.
A
Ashley Hagaman
Yale University, New Haven, CT, USA.
S
Sarah Lowe
Yale University, New Haven, CT, USA.
A
Aimee Roundtree
Texas State University, San Marcos, TX, USA.