Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Enhancing the cross-domain reasoning capabilities of large language models (LLMs) remains challenging, particularly when transferring reasoning skills across disparate domains without extensive human-annotated data. Method: We propose *reasoning-oriented continual pretraining* (R-CPT), a novel paradigm that leverages synthetic data to explicitly model implicit reasoning structures. Built upon the Gemma2-9B architecture, R-CPT automatically constructs chain-of-thought (CoT)-augmented synthetic corpora from STEM and legal domains, enabling explicit reasoning process modeling and difficulty-adaptive control of reasoning depth—without relying on supervised fine-tuning. Contribution/Results: Evaluated on MMLU, R-CPT achieves consistent improvements across all domains, with up to an 8.0-point relative gain on the most difficult subset over the baseline. Crucially, the learned reasoning capabilities demonstrate strong cross-domain transferability, validating both the effectiveness and generalizability of modeling implicit reasoning structures via domain-diverse synthetic data.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

Problem

Research questions and friction points this paper is trying to address.

Evaluating synthetic data for continual pretraining in LLM reasoning

Exploring domain transfer of reasoning skills in LLMs

Assessing hidden thought reconstruction impact on reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic data for continual pretraining

Reconstructs hidden thought processes from texts

Transfers reasoning skills across diverse domains

🔎 Similar Papers

No similar papers found.