FOGO: Forgetting-aware Orthogonalization Optimizer

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses short- and long-term forgetting in standard and continual learning, which arises when dominant gradient directions suppress rare update signals. To mitigate this, the authors propose FOGO, a scalable optimizer that treats forgetting as a general optimization phenomenon. FOGO employs spectral orthogonalization of momentum updates to prevent dominant directions from monopolizing the optimization trajectory and integrates a compact codebook memory based on random projections to lightweightly resolve conflicts between current updates and historical directions at each iteration. Notably, FOGO operates without storing raw data and offers theoretical guarantees for distance preservation. Experiments demonstrate that FOGO significantly improves convergence speed and knowledge retention across diverse tasks—including class-imbalanced classification, continual visual learning, LLaVA-7B fine-tuning, and GPT-2 pretraining—outperforming state-of-the-art optimizers such as Adam and Muon.

📝 Abstract

We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

gradient interference

continual learning

optimization

knowledge retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient orthogonalization

forgetting-aware optimization

compact codebook memory