Consensus Sampling for Safer Generative AI

· 2025 · cs.AI · arXiv 2511.09493

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our efficient sampler is Pareto-optimal for the tradeoff between worst-case risk and abstention. Experiments on synthetic distributions and image generation illustrate the general mechanism and its motivating safety application. The method requires overlap among safe distributions, but it provides a model-agnostic way to inherit guarantees from an unknown reliable subset.

representative citing papers

AI Integrity: Defending Against Backdoors and Secret Loyalties

cs.CY · 2026-04-25 · conditional · novelty 4.0

The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.

citing papers explorer

Showing 1 of 1 citing paper.

AI Integrity: Defending Against Backdoors and Secret Loyalties cs.CY · 2026-04-25 · conditional · none · ref 19 · internal anchor
The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.

Consensus Sampling for Safer Generative AI

fields

years

verdicts

representative citing papers

citing papers explorer