hub

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy · 2014 · stat.ML · arXiv 1412.6572

88 Pith papers cite this work. Polarity classification is still indexing.

88 Pith papers citing it

open full Pith review browse 88 citing papers arXiv PDF

abstract

Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

math.OC · 2026-05-09 · unverdicted · novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing

cs.CR · 2026-04-08 · unverdicted · novelty 8.0

Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.

TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

cs.LG · 2026-05-07 · conditional · novelty 7.0

Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.

Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

stat.ME · 2026-05-02 · unverdicted · novelty 7.0

MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.

Decision Boundary-aware Generation for Long-tailed Learning

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

quant-ph · 2026-05-01 · unverdicted · novelty 7.0

QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.

From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Duality for the Adversarial Total Variation

math.AP · 2026-04-20 · unverdicted · novelty 7.0

Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning

quant-ph · 2026-04-16 · unverdicted · novelty 7.0

Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry sectors substantially improves robustness.

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

Efficient Unlearning through Maximizing Relearning Convergence Delay

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.

citing papers explorer

Showing 50 of 82 citing papers after filters.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
Online Learning-to-Defer with Varying Experts stat.ML · 2026-05-12 · unverdicted · none · ref 44 · internal anchor
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 12 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle math.OC · 2026-05-09 · unverdicted · none · ref 76 · internal anchor
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing cs.CR · 2026-04-08 · unverdicted · none · ref 14 · internal anchor
Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.
TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.
Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics cs.LG · 2026-05-07 · conditional · none · ref 2 · internal anchor
Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.
Empirical Evidence for Simply Connected Decision Regions in Image Classifiers cs.CV · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization cs.CR · 2026-05-06 · conditional · none · ref 44 · internal anchor
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference stat.ME · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
Decision Boundary-aware Generation for Long-tailed Learning cs.CV · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.
Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks quant-ph · 2026-05-01 · unverdicted · none · ref 6 · internal anchor
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models cs.CV · 2026-05-01 · unverdicted · none · ref 57 · internal anchor
An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.
Low Rank Adaptation for Adversarial Perturbation cs.LG · 2026-04-30 · unverdicted · none · ref 9 · internal anchor
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 57 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Benign Overfitting in Adversarial Training for Vision Transformers cs.LG · 2026-04-21 · unverdicted · none · ref 26 · internal anchor
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 37 · internal anchor
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Duality for the Adversarial Total Variation math.AP · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.
Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning quant-ph · 2026-04-16 · unverdicted · none · ref 13 · internal anchor
Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry sectors substantially improves robustness.
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification cs.CV · 2026-04-16 · unverdicted · none · ref 28 · internal anchor
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory cs.LG · 2026-04-14 · unverdicted · none · ref 8 · internal anchor
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
Learning Robustness at Test-Time from a Non-Robust Teacher cs.CV · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
Efficient Unlearning through Maximizing Relearning Convergence Delay cs.LG · 2026-04-10 · unverdicted · none · ref 20 · internal anchor
The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.
On the Decompositionality of Neural Networks cs.LO · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats cs.CR · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 17 · internal anchor
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 15 · internal anchor
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
Quantitative Linear Logic for Neuro-Symbolic Learning and Verification cs.LO · 2026-05-13 · unverdicted · none · ref 61 · internal anchor
Quantitative Linear Logic interprets logical connectives via natural ML operations on logits to embed constraints in neural training while satisfying most linear logic laws and correlating performance with independent verification.
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder cs.LG · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and dissociation between accuracy and calibration.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
The Propagation Field: A Geometric Substrate Theory of Deep Learning cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting in continual learning.
Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness cs.CV · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
MAPR aligns latent and intrinsic geometries in 3D point cloud models via regularization on curvature and diffusion features plus consistency loss, yielding +20% average robustness gains on ModelNet40 without adversarial training.
RELO: Reinforcement Learning to Localize for Visual Object Tracking cs.CV · 2026-05-08 · unverdicted · none · ref 190 · internal anchor
RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks cs.LG · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
UAT-MC improves defense against evasion promotion attacks in multimodal recommenders by aligning gradients across modalities during untargeted adversarial training.
Distributionally Robust Multi-Objective Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 57 · internal anchor
DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern cs.CV · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours cs.AI · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.
Detecting Adversarial Data via Provable Adversarial Noise Amplification cs.LG · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training cs.CR · 2026-05-02 · unverdicted · none · ref 13 · internal anchor
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 9 · internal anchor
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Asymmetric Invertible Threat: Learning Reversible Privacy Defense for Face Recognition cs.CV · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
ARFP is a key-conditioned reversible face cloaking method that resists unauthorized restoration attacks while enabling authorized recovery with tamper indication.
Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems cs.LG · 2026-05-01 · unverdicted · none · ref 87 · internal anchor
A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.
Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders quant-ph · 2026-04-30 · unverdicted · none · ref 8 · internal anchor
A quantum autoencoder purifies adversarial perturbations for quantum classifiers and supplies a confidence score for unrecoverable inputs, claiming up to 68% accuracy gains over prior defenses without adversarial training.
Controlled Steering-Based State Preparation for Adversarial-Robust Quantum Machine Learning quant-ph · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
A passive steering method for quantum state preparation improves adversarial accuracy in QML models by up to 40% across tested cases.
Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing cs.LG · 2026-04-29 · unverdicted · none · ref 18 · internal anchor
A framework unifies runtime monitoring for safety-critical ML into ODD, OOD, and OMS categories and demonstrates them on vision-based runway detection for landing.
Threat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms cs.CR · 2026-04-28 · unverdicted · none · ref 13 · internal anchor
A threat-oriented digital twinning methodology and open-source modular twin is introduced for security evaluation of autonomous platforms, translating threat analysis into controllable tests for spoofing, replay, and adversarial ML attacks.
IPRU: Input-Perturbation-based Radio Frequency Fingerprinting Unlearning for LAWNs eess.SP · 2026-04-27 · unverdicted · none · ref 15 · internal anchor
IPRU erases target AAV radio fingerprints via an optimized input perturbation vector, delivering 1.41% unlearning accuracy, 99.41% remaining accuracy, full membership-inference resistance, and 5.79X speedup over retraining.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

Explaining and Harnessing Adversarial Examples

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer