🤖 AI Summary
In safety-critical domains such as maritime operations, manual assessment of procedural communication compliance suffers from low efficiency and poor reproducibility. To address this, we propose Prompt-and-Check: a zero-shot, context-augmented prompting framework that leverages open-source large language models (LLaMA 2/3, Mistral) on local GPU hardware (RTX 4070) to perform fine-grained compliance classification directly from dialogue transcripts—without model fine-tuning. The method enables context-aware reasoning and fully offline deployment. Experimental evaluation demonstrates strong agreement between model predictions and domain expert annotations (Cohen’s κ > 0.85), significantly enhancing automation and objectivity in post-training debriefing of simulation-based training. Prompt-and-Check establishes a lightweight, interpretable, and deployable paradigm for compliance assessment in high-reliability human–AI collaborative settings.
📝 Abstract
Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.