🤖 AI Summary
This work investigates three critical properties of post-trained language models: (1) self-awareness of their own decision-making strategies, (2) cross-domain generalization capability, and (3) alignment between internal reasoning trajectories and final outputs. We propose a three-dimensional evaluation framework and conduct systematic comparisons across supervised fine-tuning (SFT), direct preference optimization (DPO), and group-relative policy optimization (GRPO) models on diverse multi-task benchmarks. Results show that reinforcement learning–based methods—particularly DPO and GRPO—significantly outperform SFT in strategy awareness and cross-task transfer. However, all RL-based models exhibit weak alignment between reasoning paths and outputs, with GRPO showing the most severe inconsistency. To our knowledge, this is the first study to systematically expose the “strong behavior, weak reasoning” tension inherent in current RL-based post-training paradigms. Our findings provide empirical grounding for advancing interpretable AI and trustworthy reasoning modeling, highlighting concrete directions for improving reasoning fidelity in policy-optimized language models.
📝 Abstract
Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.