🤖 AI Summary
This work identifies a “seed-induced uniqueness” phenomenon in Transformers: under identical random seeds, teacher models implicitly encode student-decodable hidden features without compromising primary task performance; conversely, differing seeds severely impair such implicit knowledge transfer. We find this stems from alignment of discriminative feature subspaces—not global representation similarity—prompting our subspace-level Centered Kernel Alignment (CKA) diagnostic method. Using synthetic corpora, residual probing, adversarial inversion, and projection regularization, we quantitatively detect and control inter-model information leakage. Experiments show same-seed students exhibit significantly higher leakage (τ ≈ 0.24) than cross-seed counterparts (τ ≈ 0.12–0.13), despite near-perfect global CKA (>0.9). Our proposed safety mechanisms suppress leakage with zero primary-task accuracy degradation. The core contribution is the first attribution of implicit knowledge transfer to subspace alignment, establishing an interpretable, intervenable framework for diagnosis and mitigation.
📝 Abstract
We analyze subliminal transfer in Transformer models, where a teacher embeds hidden traits that can be linearly decoded by a student without degrading main-task performance. Prior work often attributes transferability to global representational similarity, typically quantified with Centered Kernel Alignment (CKA). Using synthetic corpora with disentangled public and private labels, we distill students under matched and independent random initializations. We find that transfer strength hinges on alignment within a trait-discriminative subspace: same-seed students inherit this alignment and show higher leakage {τapprox} 0.24, whereas different-seed students--despite global CKA > 0.9--exhibit substantially reduced excess accuracy {τapprox} 0.12 - 0.13. We formalize this with subspace-level CKA diagnostic and residualized probes, showing that leakage tracks alignment within the trait-discriminative subspace rather than global representational similarity. Security controls (projection penalty, adversarial reversal, right-for-the-wrong-reasons regularization) reduce leakage in same-base models without impairing public-task fidelity. These results establish seed-induced uniqueness as a resilience property and argue for subspace-aware diagnostics for secure multi-model deployments.