MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Large language models (LLMs) face inherent limitations in long-horizon, multi-step reasoning due to fixed context windows. To address this, we propose MOTIF—a reinforcement fine-tuning framework for modularized chain-of-thought training. MOTIF explicitly decomposes reasoning into learnable, stage-wise inference policies, thereby extending effective context through iterative, multi-turn thought generation. It integrates group-relative policy optimization (GRPO) with parameter-efficient fine-tuning (PEFT) to enable efficient, scalable multi-turn reasoning training. Evaluated on Qwen2.5-3B-Instruct, MOTIF is the first method to formulate modular reasoning as an end-to-end trainable, staged policy—circumventing the single-generation context bottleneck. On MATH500 and AIME2024 benchmarks, MOTIF achieves absolute accuracy gains of +3.8% and +3.3%, respectively. Notably, it surpasses standard GRPO using only 15% of the training data, demonstrating substantial improvements in sample efficiency and complex mathematical reasoning capability.

Technology Category

Application Category

📝 Abstract

Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $ extbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8% and 3.3% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

Problem

Research questions and friction points this paper is trying to address.

Overcoming LLM context size limits for reasoning

Enabling modular multi-round thinking in LLMs

Improving sample efficiency in reinforcement fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular thinking via reinforcement fine-tuning

Group relative policy optimization algorithm

Parameter efficient fine-tuning for sample efficiency

🔎 Similar Papers

No similar papers found.