Consensus Sampling for Safer Generative AI

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Hidden safety risks in generative AI—undetectable via output or activation analysis—pose significant challenges to reliable deployment. Method: We propose an architecture-agnostic safety enhancement framework: proactive refusal via multi-model consensus sampling. Instead of relying on a single trusted model, our method aggregates outputs from heterogeneous generative models and generates responses only when a safety-critical subset achieves probabilistic consensus; otherwise, it proactively refuses to respond. It innovatively amplifies the provable safety of the safest model subset to the entire system, incorporating copyright-inspired overlapping-output constraints and a theoretically grounded consensus threshold mechanism. Results: Our theoretical analysis shows that, under mild assumptions—namely, a majority of models being safe and exhibiting output consistency—the algorithm yields low-risk responses with high probability while strictly bounding the refusal rate. This work establishes the first ensemble-based defense framework for generative AI with quantifiable safety gains and formal, end-to-end guarantees.

Technology Category

Application Category

📝 Abstract

Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models'ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

Problem

Research questions and friction points this paper is trying to address.

Enhancing AI safety through consensus sampling of multiple generative models

Achieving risk levels competitive with the safest subset of models

Providing model-agnostic safety by amplifying guarantees from safe models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consensus sampling algorithm aggregates multiple generative models

Achieves safety competitive with safest subset of models

Model-agnostic approach leveraging output probability computations

🔎 Similar Papers

Tackling copyright issues in AI image generation through originality estimation and genericization