🤖 AI Summary
To address the challenge of efficient modeling for few-shot multiple-choice question answering (MCQA) under resource-constrained settings, this paper proposes a dual-role large language model (LLM)-driven knowledge distillation framework. Specifically, an LLM simultaneously synthesizes high-quality MCQA instances and generates calibrated confidence scores to automatically construct supervision signals—eliminating the need for manual annotation. The distilled knowledge is then transferred to a lightweight DeBERTa-v3-base student model. Our approach integrates synthetic data generation, probabilistic calibration, and few-shot fine-tuning, substantially reducing inference overhead compared to direct LLM usage. On the MMLU benchmark, our method achieves a 10.4-percentage-point accuracy gain (28.9% → 39.3%) using only five in-context examples—significantly outperforming standard fine-tuning baselines. This demonstrates both the efficacy and novelty of our framework for low-resource, high-efficiency MCQA.
📝 Abstract
Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.