AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit insufficient domain-specific knowledge depth and weak reasoning capabilities in specialized fields such as astronomy—especially at medium-to-small parameter scales. To address this, we introduce AstroSage-70B: the first 70-billion-parameter LLM exclusively designed for the full spectrum of astronomical domains. Our method innovatively embeds interpretable reasoning chains explicitly into supervised fine-tuning data, enabling a “thinking-as-output” astronomical reasoning paradigm. It further integrates continuous pretraining on astronomical literature, multi-stage domain-adaptive reinforcement training, model merging, rigorous data curation, and hyperparameter optimization. Evaluated on the authoritative AstroMLab-1 benchmark (4,425 questions), AstroSage-70B achieves state-of-the-art performance—outperforming all open- and closed-source baselines, including o3, Gemini-2.5-Pro, and Claude-3.7-Sonnet—thereby substantially alleviating the longstanding trade-off between model scale and domain expertise.

Technology Category

Application Category

📝 Abstract
General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.
Problem

Research questions and friction points this paper is trying to address.

General-purpose LLMs lack specialized astronomy knowledge
Existing models underperform in complex astronomical tasks
Need for domain-specialized AI in astronomy research/education
Innovation

Methods, ideas, or system contributions that make the work stand out.

70B-parameter domain-specialized astronomy model
extensive pre-training on astronomical literature
integrated reasoning chains in SFT dataset
🔎 Similar Papers
No similar papers found.
T
Tijmen de Haan
Institute of Particle and Nuclear Studies (IPNS), High Energy Accelerator Research Organization (KEK), Tsukuba, Ibaraki, Japan; International Center for Quantum-field Measurement Systems for Studies of the Universe and Particles (QUP-WPI), High Energy Accelerator Research Organization (KEK), Tsukuba, Ibaraki, Japan
Y
Y.-S. Ting
Department of Astronomy, The Ohio State University, Columbus, OH, USA; Center for Cosmology and AstroParticle Physics (CCAPP), The Ohio State University, Columbus, OH, USA
Tirthankar Ghosal
Tirthankar Ghosal
Oak Ridge National Laboratory
Natural Language ProcessingMachine LearningArtificial IntelligenceInformation Extraction
Tuan Dung Nguyen
Tuan Dung Nguyen
University of Pennsylvania
Computational Social ScienceAI For Science
Alberto Accomazzi
Alberto Accomazzi
Director, NASA Astrophysics Data Systerm, Smithsonian Astrophysical Observatory
AstronomyInformation Science
E
Emily Herron
National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, USA
V
Vanessa Lama
National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, USA
R
Rui Pan
Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
A
Azton Wells
Computational Science Division, Argonne National Laboratory, Lemont, IL, USA
Nesar Ramachandra
Nesar Ramachandra
Computational Scientist, Argonne National Laboratory
CosmologyMachine Learning