🤖 AI Summary
Current large language model–based question answering systems struggle to accurately quantify how predictions depend on multiple knowledge sources—such as context passages, retrieved results, and reasoning steps—particularly in the presence of redundancy and noise. To address this, this work proposes Knot, a novel method that introduces, for the first time, a modeling mechanism capturing latent dependency factors through subset-level counterfactual supervision learning. Knot estimates fine-grained reliance on individual knowledge units without requiring test-time perturbations or additional model invocations. By integrating subset sensitivity modeling with rank-aware scoring, it effectively characterizes redundancy, substitutability, and complementarity among knowledge sources. Experiments demonstrate that Knot significantly outperforms existing approaches in both multiple-choice and generative QA tasks, achieving superior performance in subset sensitivity prediction and faithfulness of knowledge attribution, while also effectively identifying high-risk erroneous predictions.
📝 Abstract
Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.