🤖 AI Summary
Large language models (LLMs) exhibit limited reasoning capabilities in specialized mathematical domains such as wireless communications.
Method: We propose a lightweight reinforcement learning (RL) framework tailored for technical mathematical problem solving. Leveraging our newly constructed WirelessMathBench-XL benchmark (4,027 problems), we exploit the verifiability of problem solutions to design an unsupervised binary reward signal and directly optimize wireless mathematical reasoning on a 7B-parameter model using Group Relative Policy Optimization (GRPO).
Contribution/Results: This work presents the first unsupervised RL fine-tuning method for wireless communication mathematics. It achieves a 39.5% accuracy on WirelessMathBench-XL—comparable to GPT-4o—despite using only a 7B model. Moreover, it yields an average +8.4-point improvement across general mathematical benchmarks, demonstrating strong positive transfer. The framework thus bridges the gap between domain-specific mathematical reasoning and efficient, scalable RL-based adaptation for LLMs.
📝 Abstract
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.