Query-Level Uncertainty in Large Language Models

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of identifying knowledge boundaries in large language models (LLMs). We propose a training-free, query-level uncertainty detection method that enables confidence assessment *prior to token generation*, facilitating adaptive inference—including RAG triggering, “slow-thinking” activation, or abstention. Our key innovation is an *intrinsic confidence* mechanism, which quantifies uncertainty without model retraining by jointly analyzing intra-layer attention entropy, inter-layer confidence consistency, and token-level self-calibration scores—overcoming limitations of output-distribution– or post-hoc–based approaches. Evaluated on factual QA and mathematical reasoning benchmarks, our method significantly outperforms multiple baselines. It seamlessly integrates with efficient RAG and model cascading, reducing inference cost by 37% while preserving task performance.

Technology Category

Application Category

📝 Abstract

It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.

Problem

Research questions and friction points this paper is trying to address.

Detect knowledge boundaries in Large Language Models

Determine model capability without generating tokens

Improve efficient RAG and model cascading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-Level Uncertainty for knowledge boundaries

Training-free Internal Confidence method

Efficient RAG and model cascading

🔎 Similar Papers

Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models