WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit limited reasoning capabilities in specialized mathematical domains such as wireless communications. Method: We propose a lightweight reinforcement learning (RL) framework tailored for technical mathematical problem solving. Leveraging our newly constructed WirelessMathBench-XL benchmark (4,027 problems), we exploit the verifiability of problem solutions to design an unsupervised binary reward signal and directly optimize wireless mathematical reasoning on a 7B-parameter model using Group Relative Policy Optimization (GRPO). Contribution/Results: This work presents the first unsupervised RL fine-tuning method for wireless communication mathematics. It achieves a 39.5% accuracy on WirelessMathBench-XL—comparable to GPT-4o—despite using only a 7B model. Moreover, it yields an average +8.4-point improvement across general mathematical benchmarks, demonstrating strong positive transfer. The framework thus bridges the gap between domain-specific mathematical reasoning and efficient, scalable RL-based adaptation for LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.

Problem

Research questions and friction points this paper is trying to address.

Teaching LLMs specialized wireless communication mathematics

Overcoming catastrophic failures in technical mathematical reasoning

Enhancing compact models to match large model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific reinforcement learning with verifiable rewards

Group Relative Policy Optimization without supervised warm-start

Compact models achieving large model performance with fewer parameters

🔎 Similar Papers

No similar papers found.

Authors to Follow