InSQuAD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing retrieval-augmented in-context learning (ICL) methods often neglect example diversity, leading to redundant retrieved examples and poor generalization. Method: We propose a unified submodular mutual information (SMI)-based exemplar selection framework that jointly models quality and diversity. Specifically, we design a likelihood-driven SMI optimization objective and develop an end-to-end trainable retrieval-generation joint paradigm. To further enhance generalization, we introduce a synthetic rewriting strategy to construct a multi-hop question-answering dataset. Contribution/Results: Our method achieves significant improvements over state-of-the-art baselines across nine benchmark datasets. Empirical results validate the effectiveness and robustness of SMI-driven diversity-aware retrieval for boosting ICL performance. The framework establishes a novel paradigm for selecting high-quality, diverse contextual examples, advancing both theoretical grounding and practical applicability in retrieval-augmented ICL.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce InSQuAD, designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information} (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuAD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuAD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing multi-hop question answering dataset with synthetically generated paraphrases. Adopting the retrieval model trained using this strategy alongside the novel targeted selection formulation for ICL on nine benchmark datasets shows significant improvements validating the efficacy of our approach.
Problem

Research questions and friction points this paper is trying to address.

Enhances In-Context Learning via quality and diversity
Addresses retrieval models overlooking diversity in exemplars
Introduces combinatorial training with SMI for retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Submodular Mutual Information for exemplar selection
Combinatorial training with likelihood-based loss
Synthetic paraphrase augmentation for dataset enhancement
🔎 Similar Papers
No similar papers found.
S
Souradeep Nanda
Computer Science, The University of Texas at Dallas, Dallas, USA
Anay Majee
Anay Majee
The University of Texas at Dallas, Microsoft, Intel
Submodular FunctionsFew-shot learningRepresentation Learningcomputer vision
R
Rishabh Iyer
Computer Science, The University of Texas at Dallas, Dallas, USA