🤖 AI Summary
This work addresses the prevalent misuse of the term “positive backdoor” in the AI/ML community, which has led to mischaracterizations of trigger-based hidden behaviors and a lack of standardized evaluation criteria. To rectify this, the paper proposes replacing the ambiguous label with “covert alignment” and introduces a systematic evaluation framework encompassing six dimensions: effectiveness, harmlessness, persistence, confidentiality, integrity, and availability (CIA). Leveraging a unified behavioral modeling approach, the study conducts empirical analyses based on behavior density and decision complexity. The findings reveal significant deficiencies in existing methods across the CIA triad and, for the first time, explicitly identify vulnerabilities of covert alignment in three critical application contexts: access control, ownership attribution, and secure execution—thereby advocating for a verifiable, standardized assessment methodology.
📝 Abstract
This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as "positive backdoors" has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.