Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses reasoning errors in large language models (LLMs) arising from compositionality challenges in the absence of ground-truth labels. It proposes Operator Consistency (OC), an unsupervised diagnostic metric grounded in operator theory, which quantifies the agreement between a model’s direct response to a composite question and its step-by-step reasoning after decomposing the same question. As the first study to apply operator theory to LLM reliability evaluation, OC requires no supervision and generalizes across arbitrary compositional tasks. Evaluated on 12 LLMs and 4 multi-hop question-answering datasets, OC exhibits strong correlation with accuracy (r ∈ [0.86, 0.94]), significantly outperforming baselines such as Chain-of-Thought Self-Consistency (CoT-SC), and yields statistically significant improvements in selective prediction as measured by AUARC and AUROC.

📝 Abstract

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

Problem

Research questions and friction points this paper is trying to address.

compositional reasoning

reasoning failures

label-free detection

large language models

inference-time diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

operadic consistency

compositional reasoning

label-free detection