Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reasoning evaluation is frequently compromised by data contamination—benchmark datasets (e.g., MATH-500, AMC) overlap with pretraining corpora, inflating performance metrics (e.g., Qwen2.5’s overfitting on AIME) and undermining the credibility of RL-based reasoning improvement. To address this, we propose RandomCalculation, a controllable generator that synthesizes entirely novel, leakage-free arithmetic problems to construct clean evaluation sets. Building upon this, we design a contamination-free RL training framework to rigorously assess how reward signal quality affects reasoning gains: only accurate rewards yield significant improvements, whereas erroneous or random rewards fail. Our key contribution is the first deep integration of controllable synthetic data generation with clean benchmark evaluation—revealing data integrity and reward accuracy as fundamental prerequisites for RL-enhanced reasoning—and establishing a new paradigm for reliable, cross-model-family evaluation.

Technology Category

Application Category

📝 Abstract
The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.
Problem

Research questions and friction points this paper is trying to address.

Investigates unreliable RL-enhanced LLM reasoning due to data contamination
Assesses impact of noisy reward signals on model performance
Proposes clean synthetic dataset for reliable RL evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces synthetic arithmetic problem generator
Uses clean dataset called RandomCalculation
Advocates uncontaminated benchmarks evaluation
🔎 Similar Papers
No similar papers found.
Mingqi Wu
Mingqi Wu
Director of Data Science, Microsoft
AIMachine LearningStatisticsdata science
Z
Zhihao Zhang
Fudan University, Shanghai Artificial Intelligence Laboratory
Qiaole Dong
Qiaole Dong
Fudan University
Computer Vision
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
J
Jun Zhao
Fudan University
Senjie Jin
Senjie Jin
Fudan University
natural language processing
Xiaoran Fan
Xiaoran Fan
Fudan University
Y
Yuhao Zhou
Fudan University
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
Q
Qin Liu
University of California, Davis
S
Songyang Zhang
Shanghai Artificial Intelligence Laboratory
Q
Qi Zhang
Fudan University, Shanghai Artificial Intelligence Laboratory