🤖 AI Summary
This study addresses the automated proof of Shannon-type entropy inequalities, a high-complexity combinatorial search problem requiring the construction of nontrivial linear combinations of constraints over many random variables. We propose a novel approach that integrates a fine-tuned small-scale large language model (0.6B–1.7B parameters) with a guided beam search tree algorithm, enhanced by a heuristic scoring mechanism to efficiently explore proof paths. Systematic ablation studies reveal that training context length and data distribution have limited impact on performance and uncover key failure modes. Evaluated on a benchmark set of 60 inequalities involving 10 to 15 variables, our method achieves an 85% proof success rate, substantially outperforming GPT-5.5 (1.7%) and Psitip (33.3%), thereby demonstrating the effectiveness and novelty of the proposed beam-scoring strategy.
📝 Abstract
Proving Shannon-type entropy inequalities is a fundamental task in information theory that often requires constructing non-trivial linear combinations of known constraints, which is a combinatorial search problem that scales poorly with the number of random variables. We investigate whether small-scale large language models (0.6B--1.7B parameters), fine-tuned on atomic proof steps and combined with guided beam search, can automate this process. On a held-out test set of 60 inequalities spanning n=10 to 15 variables, our 0.6B fine-tuned model achieves an 85\% proof success rate with tree search. GPT-5.5 solves 1.7\% samples under zero-shot prompting while Psitip solves 33.3\% samples. A systematic ablation study across training context length (4096 vs.\ 8192 tokens) and data distribution (n=9-skewed vs not skewed) reveals that a 4096-token not skewed training distribution yields the best performance, with extended context and skewed data providing no marginal benefit. We further identify two dominant failure modes -- format failures and step quality degradation -- and verify that the beam-scoring heuristic is essential via a controlled ablation (random scoring reduces success from 83\% to 23\%).