🤖 AI Summary
This work addresses the limitations of existing 802.11 packet capture diagnostics, which either rely heavily on expert knowledge—resulting in low efficiency and poor scalability—or employ large language models (LLMs) prone to hallucination, unreliable confidence estimates, and evaluation biases due to circular reasoning. To overcome these challenges, the authors propose PROBE, an evidence-driven, multi-stage diagnostic pipeline that integrates deterministic frame-level PCAP textualization, multi-model and multi-candidate ensembling, progressive masking, and a composite reliability scoring mechanism that operates without LLM-based self-evaluation. Evaluated on 87 enterprise Wi-Fi packet captures, PROBE achieves an evidence F1 score of 0.957 and a 96% automatic acceptance rate, with worst-case F1 exceeding 0.70—significantly outperforming both expert baselines (0.871) and naive ensemble approaches (0.842).
📝 Abstract
Diagnosing 802.11 packet captures requires expert protocol knowledge, is slow, inconsistent across engineers, and unscalable. LLM-based approaches sound plausible but fabricate protocol events absent from captures (especially truncated traces), produce uncalibrated confidence scores, and suffer evaluation bias when golden references are co-produced by the model under test. We introduce PROBE (Protocol Reasoning Over evidence-Based Ensembles), a multi-stage pipeline addressing all three failures. It integrates (i) deterministic PCAP-to-text normalization with frame-level verifiability, (ii) multi-run, multi-candidate ensembles with optional cross-model second opinion and progressive obfuscation, (iii) a verdict-aware evidence framework treating absence of failure evidence as contributing evidence, and (iv) a fully deterministic composite reliability score from evidence validity, run-to-run stability, and cross-model agreement without LLM self-assessment.
On 87 enterprise Wi-Fi captures (104 capture-reviewer pairs), single-pass LLM analysis raises weighted evidence F1 from 0.871 (expert baseline) to 0.912 but misses critical frames in 35% of cases. Naive ensemble voting drops below baseline (0.842) as majority voting amplifies conservative verdicts: 50% of confirmed failures are misclassified as 'no issue' or 'insufficient evidence.' Adding evidence-grounded reconciliation achieves 0.957 F1, a 96% auto-accept rate, and a worst-case floor above 0.70. LLM self-reported confidence clusters at 0.95 regardless of difficulty (71% report exactly 0.95), confirming it is uninformative. We also introduce a model-agnostic evaluation framework using per-field assertion matching, eliminating circular bias from model-co-produced golden references.