π€ AI Summary
This paper addresses the instability of large language models (LLMs) in maintaining coherence across logical, factual, and moral dimensions. To systematically investigate this challenge, we conduct a comprehensive survey of existing work and proposeβ for the first timeβa two-dimensional taxonomy distinguishing formal coherence (e.g., logical consistency) from informal coherence (e.g., factual and value alignment). Our methodology integrates critical literature analysis, multilingual benchmark diagnostics, and cross-model coherence measurement design. Through this approach, we identify six key gaps: inconsistent definitions, lack of multilingual evaluation protocols, weak domain adaptability, insufficient interpretability, limited cross-disciplinary integration, and inadequate robustness assessment. Our principal contributions are threefold: (1) establishing the first unified classification framework for coherence research; (2) advancing standardized definitions, multilingual evaluation protocols, and domain-adaptive enhancement strategies; and (3) facilitating the development of robust, interpretable, and interdisciplinary coherence benchmarks and governance pathways.
π Abstract
The hallmark of effective language use lies in consistency -- expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models struggle to maintain reliable consistency across different scenarios. This paper examines the landscape of consistency research in AI language systems, exploring both formal consistency (including logical rule adherence) and informal consistency (such as moral and factual coherence). We analyze current approaches to measure aspects of consistency, identify critical research gaps in standardization of definitions, multilingual assessment, and methods to improve consistency. Our findings point to an urgent need for robust benchmarks to measure and interdisciplinary approaches to ensure consistency in the application of language models on domain-specific tasks while preserving the utility and adaptability.