Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenges of multi-step symbolic reasoning, precise computation, and deep conceptual understanding required by high-stakes STEM examinations such as JEE and NEET. Building upon the open-source GPT-OSS-20B model, the authors curate high-quality training data from PhysicsWallah’s proprietary question bank and introduce a novel reinforcement learning post-training approach that uniquely integrates verifiable reward mechanisms with a progressive large-scale rollout strategy. This methodology substantially enhances both the accuracy and efficiency of problem-solving, achieving state-of-the-art performance across multiple benchmarks—including JEE Main/Advanced, NEET, AIME, HMMT, and MMLU-Pro—while reducing the number of output tokens by up to 64% compared to baseline models.

📝 Abstract

Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

Problem

Research questions and friction points this paper is trying to address.

STEM reasoning

competitive examinations

symbolic reasoning

numerical computation

conceptual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

STEM reasoning

verifiable rewards