Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study

๐Ÿ“… 2024-03-19
๐Ÿ›๏ธ JMIR Formative Research
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of conducting structured psychiatric assessments during clinical interviews with North Korean defectors, a population exhibiting complex trauma-related psychopathology. Method: We propose the first systematic framework leveraging large language models (LLMs) for such assessments, integrating GPT-4 Turbo, retrieval-augmented generation (RAG), and expert-annotated supervised fine-tuning. We introduce a novel symptomโ€“stressor alignment mechanism to jointly extract symptoms and their contextual stressors, and evaluate supervised fine-tuning against zero-shot prompting for symptom identification. Contribution/Results: Our approach achieves an F1-score of 0.82 for symptom identification; clinical summaries attain G-Eval scores of 4.66 (coherence) and 4.67 (relevance); and 73 out of 102 symptom-containing segments are localized with mid-token distance *d* < 20. The framework significantly enhances assessment efficiency, inter-rater consistency, and clinical interpretability.

Technology Category

Application Category

๐Ÿ“ Abstract
Background Recent advancements in large language models (LLMs) have accelerated their use across various domains. Psychiatric interviews, which are goal-oriented and structured, represent a significantly underexplored area where LLMs can provide substantial value. In this study, we explore the application of LLMs to enhance psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced traumatic events and mental health issues. Objective This study aims to investigate whether LLMs can (1) delineate parts of the conversation that suggest psychiatric symptoms and identify those symptoms, and (2) summarize stressors and symptoms based on the interview dialogue transcript. Methods Given the interview transcripts, we align the LLMs to perform 3 tasks: (1) extracting stressors from the transcripts, (2) delineating symptoms and their indicative sections, and (3) summarizing the patients based on the extracted stressors and symptoms. These 3 tasks address the 2 objectives, where delineating symptoms is based on the output from the second task, and generating the summary of the interview incorporates the outputs from all 3 tasks. In this context, the transcript data were labeled by mental health experts for the training and evaluation of the LLMs. Results First, we present the performance of LLMs in estimating (1) the transcript sections related to psychiatric symptoms and (2) the names of the corresponding symptoms. In the zero-shot inference setting using the GPT-4 Turbo model, 73 out of 102 transcript segments demonstrated a recall mid-token distance d<20 for estimating the sections associated with the symptoms. For evaluating the names of the corresponding symptoms, the fine-tuning method demonstrates a performance advantage over the zero-shot inference setting of the GPT-4 Turbo model. On average, the fine-tuning method achieves an accuracy of 0.82, a precision of 0.83, a recall of 0.82, and an F1-score of 0.82. Second, the transcripts are used to generate summaries for each interviewee using LLMs. This generative task was evaluated using metrics such as Generative Evaluation (G-Eval) and Bidirectional Encoder Representations from Transformers Score (BERTScore). The summaries generated by the GPT-4 Turbo model, utilizing both symptom and stressor information, achieve high average G-Eval scores: coherence of 4.66, consistency of 4.73, fluency of 2.16, and relevance of 4.67. Furthermore, it is noted that the use of retrieval-augmented generation did not lead to a significant improvement in performance. Conclusions LLMs, using either (1) appropriate prompting techniques or (2) fine-tuning methods with data labeled by mental health experts, achieved an accuracy of over 0.8 for the symptom delineation task when measured across all segments in the transcript. Additionally, they attained a G-Eval score of over 4.6 for coherence in the summarization task. This research contributes to the emerging field of applying LLMs in psychiatric interviews and demonstrates their potential effectiveness in assisting mental health practitioners.
Problem

Research questions and friction points this paper is trying to address.

Enhancing psychiatric interviews using LLMs
Identifying symptoms from conversation transcripts
Summarizing stressors and symptoms effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs for symptom recognition
LLMs generate coherent interview summaries
GPT-4 Turbo used for zero-shot inference
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jae-hee So
Department of Applied Statistics, Yonsei University
J
Joonhwan Chang
Department of Applied Statistics, Yonsei University
E
Eunji Kim
Institute of Behavioral Sciences in Medicine, Yonsei University College of Medicine
J
Junho Na
Department of Applied Statistics, Yonsei University
J
JiYeon Choi
Department of Nursing, Mo-Im Kim Nursing Research Institute, Yonsei University College of Nursing; Institute for Innovation in Digital Healthcare, Yonsei University
Jy-yong Sohn
Jy-yong Sohn
Yonsei University
Machine LearningInformation Theory
Byung-Hoon Kim
Byung-Hoon Kim
Yonsei University, College of Medicine
PsychiatryNeuroimagingLarge Multimodal Models
S
Sang Hui Chu
Department of Nursing, Mo-Im Kim Nursing Research Institute, Yonsei University College of Nursing