π€ AI Summary
This paper addresses semantic retrieval errors in remote document retrieval over symbol-erasure channels, caused by the loss of query features. To tackle this, we propose a context-aware semantic communication framework. Methodologically, we design an adaptive repetition coding strategy guided by contextual importance: query feature vectors are extracted via term-frequency weighting, and redundancy is dynamically allocated according to semantic significance; a closed-form upper bound on retrieval error probability is derived using Gaussian approximation; and at the decoder, semantic recovery is performed via context-similarity-based decision making. Experiments on synthetic data and the Google Natural Questions dataset demonstrate that our approach significantly reduces retrieval error rates induced by critical feature erasures. Theoretical analysis aligns closely with empirical results, establishing a novel paradigm for robust, semantics-driven retrieval under unreliable channels.
π Abstract
This paper introduces and analyzes a search and retrieval model that adopts key semantic communication principles from retrieval-augmented generation. We specifically present an information-theoretic analysis of a remote document retrieval system operating over a symbol erasure channel. The proposed model encodes the feature vector of a query, derived from term-frequency weights of a language corpus by using a repetition code with an adaptive rate dependent on the contextual importance of the terms. At the decoder, we select between two documents based on the contextual closeness of the recovered query. By leveraging a jointly Gaussian approximation for both the true and reconstructed similarity scores, we derive an explicit expression for the retrieval error probability, i.e., the probability under which the less similar document is selected. Numerical simulations on synthetic and real-world data (Google NQ) confirm the validity of the analysis. They further demonstrate that assigning greater redundancy to critical features effectively reduces the error rate, highlighting the effectiveness of semantic-aware feature encoding in error-prone communication settings.