pith. sign in

Consensus Sampling for Safer Generative AI

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our efficient sampler is Pareto-optimal for the tradeoff between worst-case risk and abstention. Experiments on synthetic distributions and image generation illustrate the general mechanism and its motivating safety application. The method requires overlap among safe distributions, but it provides a model-agnostic way to inherit guarantees from an unknown reliable subset.

fields

cs.CY 1

years

2026 1

verdicts

CONDITIONAL 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.

  • AI Integrity: Defending Against Backdoors and Secret Loyalties cs.CY · 2026-04-25 · conditional · none · ref 19 · internal anchor

    The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.