The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.
Consensus Sampling for Safer Generative AI
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our efficient sampler is Pareto-optimal for the tradeoff between worst-case risk and abstention. Experiments on synthetic distributions and image generation illustrate the general mechanism and its motivating safety application. The method requires overlap among safe distributions, but it provides a model-agnostic way to inherit guarantees from an unknown reliable subset.
fields
cs.CY 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
AI Integrity: Defending Against Backdoors and Secret Loyalties
The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.