🤖 AI Summary
This work addresses the challenges of accuracy and stability when large language models generate clinical insights from incident reports in high-stakes domains such as healthcare. The authors propose a structured-label-guided few-shot example selection method that leverages human-interpretable labels from the Japanese Medical Incident Dataset (JMID) to steer GPT-4o and LLaMA 3.3 in generating causal factors and preventive measures. Compared to baseline strategies—including random sampling and cosine similarity-based retrieval—this approach significantly enhances both the precision and consistency of model outputs. By carefully curating demonstration examples through structured semantic guidance, the method effectively mitigates risks of safety filtering or erroneous generation caused by poorly chosen prompts, thereby establishing a novel paradigm for reliable reasoning in safety-critical applications.
📝 Abstract
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.