🤖 AI Summary
This work addresses the challenge of quantifying semantic progress in multi-turn dialogue evaluation. The authors formalize semantic progress as the reduction of uncertainty conditioned on a given question and propose an information-theoretic metric based on Gaussian embedding spaces, enabling efficient closed-form computation of information gain. Their approach uniquely models semantic progress as additive, monotonic, and exhibiting diminishing returns due to redundancy—all without requiring large language model (LLM) inference. Experimental results demonstrate strong alignment with human judgments on MT-Bench, Chatbot Arena, and UltraFeedback benchmarks, outperforming several LLM-based evaluators on MT-Bench and UltraFeedback while maintaining computational efficiency suitable for CPU execution.
📝 Abstract
Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.