🤖 AI Summary
This work addresses the challenge of real-time monitoring of cross-domain performance degradation in large language model (LLM) deployment and guiding data collection without labeled feedback. The authors propose leveraging the entropy trace of output token probability distributions during decoding as an unlabeled, inference-time signal. By constructing an 11-dimensional statistical feature vector from top-k logprobs, they train a lightweight classifier to predict the correctness of individual responses and aggregate these predictions to estimate domain-level accuracy. This approach is the first to demonstrate that decoding entropy traces can effectively support cross-model, cross-domain performance monitoring and data prioritization. Experiments across 10 STEM benchmarks and 9 mainstream LLMs (3B–20B parameters) show strong alignment between estimated and ground-truth accuracy, with multiple models exhibiting near-monotonic domain ranking capability.
📝 Abstract
Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all"10 choose k"combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.