Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of catastrophic forgetting and misalignment in continual instruction tuning, where fixed replay ratios fail to adapt to dynamic task distributions. The authors propose PROXYMIX, a novel framework that leverages the “forgetting mirror” hypothesis—empirically validated for the first time—which posits that the relative forgetting sensitivity across tasks remains consistent across model scales. By training a dynamic replay controller on a small proxy model, PROXYMIX transfers this policy to large models without requiring knowledge of future tasks. The controller constructs its state from normalized validation loss and its temporal dynamics, then adaptively blends old and new data via a mask-based mixing mechanism. Evaluated on five sequential instruction-tuning benchmarks with LLaMA-3-8B, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, enhances safety by 5.8 points, and achieves these gains at only 1/50th the policy learning cost of Oracle Target RL.

📝 Abstract

Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed replay ratios are inherently limited because the optimal mixture varies with the current domain, the training stage, and the evolving vulnerability of prior behaviors. We propose PROX-YMIX, a framework that learns a dynamic replay controller on a small proxy model and transfers the frozen controller to a larger target. The controller never observes future tasks and constructs its state from normalized validation losses and their temporal dynamics, producing a masked mixture over the current task and accessible replay buffers. Our core empirical hypothesis is forgetting mirroring: task vulnerability rankings remain largely consistent across model scales even when absolute loss magnitudes differ. We validate this assumption empirically before transferring controllers across scales. On LLaMA-3-8B across five continual instruction tuning sequences, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, and raises safety score by 5.8 points over the strongest non-oracle baseline, at roughly 50x lower policy learning cost than Oracle Target RL. The framework is leakage free and architecture independent at the interface level, and we also identify settings where the proxy assumption breaks down, highlighting limitations for robust deployment.

Problem

Research questions and friction points this paper is trying to address.

continual instruction tuning

catastrophic forgetting

replay strategy

model scaling

dynamic replay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Replay

Proxy Model Transfer

Continual Instruction Tuning