๐ค AI Summary
This work addresses a critical gap in current knowledge editing evaluation, which predominantly focuses on direct factual recall while neglecting logical consistency after edits. The authors introduce the first benchmark that explicitly incorporates logical coherence by constructing multi-hop reasoning questions grounded in logical rules from knowledge graphs. Using this framework, they systematically evaluate prominent editing methodsโsuch as ROME and fine-tuning (FT)โon their ability to maintain semantic consistency beyond immediate factual updates. Experimental results reveal that, despite strong performance on direct edits, these methods suffer significant performance degradation (up to 24%) on multi-hop logical reasoning tasks. This highlights a fundamental limitation in their semantic awareness and underscores the necessity of the proposed benchmark for advancing more robust and logically coherent knowledge editing techniques.
๐ Abstract
Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.