Large language models could be rote learners

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses benchmark contamination in LLM evaluation, which conflates rote memorization with genuine capability. We propose TrinEval, a novel trinary evaluation framework that treats contamination as an inherent aspect of the learning process and decouples memory from reasoning via question-type reconstruction. Methodologically, TrinEval integrates performance attribution analysis, MCQ-type re-modeling, and controlled memory-condition experiments. Empirical evaluation on MMLU reveals that ~20.5% of mainstream LLM responses stem from mechanical memorization—and such memorized answers degrade accuracy by an average of −3.7%. TrinEval quantitatively isolates memory effects from authentic reasoning ability, enabling contamination-robust assessment. It establishes a new paradigm for LLM evaluation, shifting focus from mere answer correctness (“how many are right?”) to causal understanding of correct responses (“why is it right?”).

Technology Category

Application Category

📝 Abstract
Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM genuine capability vs memorization in MCQ benchmarks
Reducing benchmark contamination impact on LLM evaluation reliability
Proposing TrinEval to disentangle rote learning from knowledge acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze model performance under memorization conditions
Propose TrinEval framework for MCQ reformulation
Reduce memorization while preserving knowledge assessment
🔎 Similar Papers
No similar papers found.
Y
Yuyang Xu
College of Computer Science and Technology, Zhejiang University
Renjun Hu
Renjun Hu
East China Normal University
Robust ML/AILLMsgraph mining
H
Haochao Ying
State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital Zhejiang University School of Medicine; School of Public Health, Zhejiang University
J
Jian Wu
Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
X
Xing Shi
Alibaba Cloud Computing
W
Wei Lin
Alibaba Cloud Computing