Building Better Activation Oracles

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the unreliability of existing Activation Oracles in faithfully interpreting residual stream activations, which stems from issues such as hallucination, ambiguity, and textual inversion confusion. To mitigate these limitations, the paper proposes four systematic training enhancements: training on in-policy rollout data, multi-layer activation fusion, dialogue-based data augmentation, and an optimized activation injection formulation. These strategies collectively improve the consistency and practical utility of Activation Oracles. Additionally, the authors introduce AObench, the first comprehensive benchmark suite for evaluating Activation Oracle quality, establishing a standardized framework to advance scalable, end-to-end interpretability research. While the proposed methods yield only modest gains in overall model capability, they significantly enhance the reliability and interpretive coherence of Activation Oracles.

📝 Abstract

Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, such as hallucinations and vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we improve the Activation Oracle (AO) training regime in four ways: training on on-policy rollouts, improving the conversational dataset, feeding more layers and an improvement to the injection formula. The capability improvements are marginal, but quality of life improvements are quite substantial. In addition, we open source the first comprehensive evaluation suite for AO quality, which we call AObench. Overall, we hope that our work sets a foundation that helps improve AOs and other models in the paradigm of scalable, end-to-end interpretability.

Problem

Research questions and friction points this paper is trying to address.

Activation Oracles

hallucinations

vagueness

text-inversion confounds

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Oracle

on-policy training

interpretability