🤖 AI Summary
This work addresses the challenge that large language models (LLMs) often rely on outdated or incorrect parametric knowledge when it conflicts with contextual information. We propose a zero-shot conflict detection and knowledge-source attribution method grounded in residual stream analysis. We first identify discriminative, layer-specific residual stream signatures indicative of knowledge conflicts and characterize their distinct activation patterns under parametric versus contextual knowledge dominance. Leveraging intermediate-layer activation probing, residual stream decomposition, and behavioral attribution analysis, our approach enables pre-generative conflict identification and knowledge-dependency tendency prediction—without model modification or input perturbation. Experiments across multiple benchmarks demonstrate substantial improvements in detection accuracy. Our method establishes a novel paradigm for controllable and interpretable knowledge selection in LLMs, offering fine-grained insight into how models resolve conflicting information during inference.
📝 Abstract
Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.