Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

📅 2026-01-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of real-time monitoring of cross-domain performance degradation in large language model (LLM) deployment and guiding data collection without labeled feedback. The authors propose leveraging the entropy trace of output token probability distributions during decoding as an unlabeled, inference-time signal. By constructing an 11-dimensional statistical feature vector from top-k logprobs, they train a lightweight classifier to predict the correctness of individual responses and aggregate these predictions to estimate domain-level accuracy. This approach is the first to demonstrate that decoding entropy traces can effectively support cross-model, cross-domain performance monitoring and data prioritization. Experiments across 10 STEM benchmarks and 9 mainstream LLMs (3B–20B parameters) show strong alignment between estimated and ground-truth accuracy, with multiple models exhibiting near-monotonic domain ranking capability.

Technology Category

Application Category

📝 Abstract
Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all"10 choose k"combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
Problem

Research questions and friction points this paper is trying to address.

LLM accuracy monitoring
domain shift
decoding entropy
STEM reasoning
performance estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

decoding entropy
accuracy monitoring
domain shift
LLM evaluation
entropy profile
🔎 Similar Papers
No similar papers found.
P
Pedro Memoli Buffa
Departamento de Matematica, FCEyN, Universidad de Buenos Aires
Luciano Del Corro
Luciano Del Corro
Microsoft Research
natural language understandinginformation extractionrelation extractionknowledge bases