FOGO: Forgetting-aware Orthogonalization Optimizer

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses short- and long-term forgetting in standard and continual learning, which arises when dominant gradient directions suppress rare update signals. To mitigate this, the authors propose FOGO, a scalable optimizer that treats forgetting as a general optimization phenomenon. FOGO employs spectral orthogonalization of momentum updates to prevent dominant directions from monopolizing the optimization trajectory and integrates a compact codebook memory based on random projections to lightweightly resolve conflicts between current updates and historical directions at each iteration. Notably, FOGO operates without storing raw data and offers theoretical guarantees for distance preservation. Experiments demonstrate that FOGO significantly improves convergence speed and knowledge retention across diverse tasks—including class-imbalanced classification, continual visual learning, LLaVA-7B fine-tuning, and GPT-2 pretraining—outperforming state-of-the-art optimizers such as Adam and Muon.
📝 Abstract
We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.
Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting
gradient interference
continual learning
optimization
knowledge retention
Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient orthogonalization
forgetting-aware optimization
compact codebook memory
random projection
continual learning
🔎 Similar Papers
No similar papers found.