Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

πŸ“… 2026-06-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses reasoning errors in large language models (LLMs) arising from compositionality challenges in the absence of ground-truth labels. It proposes Operator Consistency (OC), an unsupervised diagnostic metric grounded in operator theory, which quantifies the agreement between a model’s direct response to a composite question and its step-by-step reasoning after decomposing the same question. As the first study to apply operator theory to LLM reliability evaluation, OC requires no supervision and generalizes across arbitrary compositional tasks. Evaluated on 12 LLMs and 4 multi-hop question-answering datasets, OC exhibits strong correlation with accuracy (r ∈ [0.86, 0.94]), significantly outperforming baselines such as Chain-of-Thought Self-Consistency (CoT-SC), and yields statistically significant improvements in selective prediction as measured by AUARC and AUROC.
πŸ“ Abstract
Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.
Problem

Research questions and friction points this paper is trying to address.

compositional reasoning
reasoning failures
label-free detection
large language models
inference-time diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

operadic consistency
compositional reasoning
label-free detection
large language models
selective prediction
πŸ”Ž Similar Papers
No similar papers found.