🤖 AI Summary
To address the inherent trade-off between poor interpretability of chess AI decision-making and the hallucination-prone, reasoning-deficient nature of large language models (LLMs), this paper proposes the Concept-Guided Commentary Generation framework (CCC) and the expert-knowledge-informed GPT-based evaluation method (GCC-Eval). CCC pioneers the structural distillation of AlphaZero-style policy network decisions into human-interpretable concepts and integrates them into LLM commentary generation. GCC-Eval enables fine-grained, automated assessment via multi-dimensional prompt engineering. Augmented with concept extraction and ranking modules plus a human-in-the-loop verification mechanism, CCC achieves—under both comprehensive human evaluation and GCC-Eval—a 32% improvement in commentary accuracy, a 2.1× increase in information density, and a hallucination rate reduced to <5%, significantly outperforming both vanilla LLMs and rule-based template baselines.
📝 Abstract
Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for model explainability and human education. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) can produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative task of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary which is accurate, informative, and fluent.