🤖 AI Summary
Weak supervision often fails to provide reliable signals for complex outputs, limiting the generalization and scalability of weak-to-strong model transfer. This work proposes a "weak-critic strong-supervision" paradigm, wherein a weak model acts as a non-misleading critic to guide a stronger model in more effectively leveraging its own knowledge. Through Online Progressive Critic Distillation (OPCD)—integrating weak-critic generation, high-quality critique filtering, an adaptive self-teacher mechanism, and alignment-aware training—the approach embeds high-fidelity criticism directly into the strong model’s optimization process. Experiments demonstrate consistent performance gains across reasoning and alignment benchmarks, offering a viable pathway toward scalable supervision using only weakly labeled data.
📝 Abstract
As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.