🤖 AI Summary
To address the challenges of multimodal integration—specifically, the difficulty in synergizing speech, visual, and ADS-B data—and the lack of human-in-the-loop (HITL) evaluation in aviation conflict detection, this work introduces the first open-source, modular, and lightweight HITL simulation platform. Built on the Godot engine, it integrates fine-tuned Whisper ASR, YOLOv8-based visual detection, ADS-B message parsing, and GPT-OSS-20B for structured reasoning, all interconnected via standardized JSON APIs to enable plug-and-play model integration and reproducible multi-scenario testing. The platform supports representative conflict scenarios—including runway incursions and en-route conflicts—in terminal maneuvering areas and airways. Empirical evaluation yields an average time-to-first-alert of 7.7 seconds, with ASR and vision processing latencies of ~5.9 s and 0.4 s, respectively, validating effective multimodal synergy. Key contributions include: (1) a unified benchmarking framework; (2) a reproducible scenario suite; and (3) end-to-end HITL evaluation capability—collectively enhancing development efficiency and trustworthiness of flight safety assistance systems.
📝 Abstract
We introduce AIRHILT (Aviation Integrated Reasoning, Human-in-the-Loop Testbed), a modular and lightweight simulation environment designed to evaluate multimodal pilot and air traffic control (ATC) assistance systems for aviation conflict detection. Built on the open-source Godot engine, AIRHILT synchronizes pilot and ATC radio communications, visual scene understanding from camera streams, and ADS-B surveillance data within a unified, scalable platform. The environment supports pilot- and controller-in-the-loop interactions, providing a comprehensive scenario suite covering both terminal area and en route operational conflicts, including communication errors and procedural mistakes. AIRHILT offers standardized JSON-based interfaces that enable researchers to easily integrate, swap, and evaluate automatic speech recognition (ASR), visual detection, decision-making, and text-to-speech (TTS) models. We demonstrate AIRHILT through a reference pipeline incorporating fine-tuned Whisper ASR, YOLO-based visual detection, ADS-B-based conflict logic, and GPT-OSS-20B structured reasoning, and present preliminary results from representative runway-overlap scenarios, where the assistant achieves an average time-to-first-warning of approximately 7.7 s, with average ASR and vision latencies of approximately 5.9 s and 0.4 s, respectively. The AIRHILT environment and scenario suite are openly available, supporting reproducible research on multimodal situational awareness and conflict detection in aviation; code and scenarios are available at https://github.com/ogarib3/airhilt.