SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

📅 2024-10-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current AI regulatory systems suffer from poor explainability, misalignment with user-defined safety preferences, and opaque decision-making—particularly lacking controllable safety tools for non-expert users. To address these challenges, we propose SafetyReporter: an interpretable, transparent, and tunable AI safety auditing framework. It introduces a novel Chain-of-Thought (CoT)-based harm-benefit tree modeling approach that structurally represents AI behaviors across multiple dimensions—likelihood, severity, and immediacy—and computes configurable harmfulness scores via a learnable yet fixed-weight aggregation function. Integrating structured causal analysis with large-model feature distillation, SafetyReporter enables decision provenance tracing, parameter-level transparency, and dynamic adjustment of safety preferences. Evaluated across multiple benchmarks, it achieves an average F1 score of 0.81—significantly outperforming state-of-the-art baselines (F1 < 0.72).

Technology Category

Application Category

📝 Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured"harm-benefit tree,"which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impact on any stakeholders. SafetyAnalyst then aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this conceptual framework to develop, test, and release an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that SafetyReporter (average F1=0.81) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

Problem

Research questions and friction points this paper is trying to address.

AI Governance

Explainable AI

User Safety Preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI Safety

Large Language Model

Visual 'Good-Bad' Tree

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?