AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit insufficient domain-specific knowledge depth and weak reasoning capabilities in specialized fields such as astronomy—especially at medium-to-small parameter scales. To address this, we introduce AstroSage-70B: the first 70-billion-parameter LLM exclusively designed for the full spectrum of astronomical domains. Our method innovatively embeds interpretable reasoning chains explicitly into supervised fine-tuning data, enabling a “thinking-as-output” astronomical reasoning paradigm. It further integrates continuous pretraining on astronomical literature, multi-stage domain-adaptive reinforcement training, model merging, rigorous data curation, and hyperparameter optimization. Evaluated on the authoritative AstroMLab-1 benchmark (4,425 questions), AstroSage-70B achieves state-of-the-art performance—outperforming all open- and closed-source baselines, including o3, Gemini-2.5-Pro, and Claude-3.7-Sonnet—thereby substantially alleviating the longstanding trade-off between model scale and domain expertise.

Technology Category

Application Category

📝 Abstract

General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.

Problem

Research questions and friction points this paper is trying to address.

General-purpose LLMs lack specialized astronomy knowledge

Existing models underperform in complex astronomical tasks

Need for domain-specialized AI in astronomy research/education

Innovation

Methods, ideas, or system contributions that make the work stand out.

70B-parameter domain-specialized astronomy model

extensive pre-training on astronomical literature

integrated reasoning chains in SFT dataset

🔎 Similar Papers

No similar papers found.

Authors to Follow