pith. machine review for the scientific record. sign in

hub

Explaining and Harnessing Adversarial Examples

88 Pith papers cite this work. Polarity classification is still indexing.

88 Pith papers citing it
abstract

Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

hub tools

citation-role summary

background 1

citation-polarity summary

claims ledger

  • abstract Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving

co-cited works

roles

background 1

polarities

background 1

clear filters

representative citing papers

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

Duality for the Adversarial Total Variation

math.AP · 2026-04-20 · unverdicted · novelty 7.0

Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 9 · internal anchor

    HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

  • Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 15 · internal anchor

    PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

  • Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 63 · internal anchor

    The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

  • Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours cs.AI · 2026-05-05 · unverdicted · none · ref 1 · internal anchor

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.

  • When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor

    AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

  • Medical Model Synthesis Architectures: A Case Study cs.AI · 2026-05-10 · unverdicted · none · ref 180 · internal anchor

    MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.

  • Real-Time Evaluation of Autonomous Systems under Adversarial Attacks cs.AI · 2026-05-05 · unverdicted · none · ref 4 · internal anchor

    A framework trains and compares MLP, transformer, and GAIL-based trajectory models on real driving data, finding that architectural differences cause large variations in robustness to PGD attacks despite similar nominal accuracy.