pith. machine review for the scientific record. sign in

arxiv: 2605.08896 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong, Zhuoyun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords margin-aware error flowFragileFlowvulnerable-risk matrixPAC-Bayes boundworst-class robustnessspectral controlcorrect-but-fragile predictionsfoundation model adaptation
0
0 comments X

The pith

FragileFlow uses spectral control on margin-aware error flows from correct predictions to deliver deterministic worst-class robustness in foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Foundation models often produce correct answers whose probability mass is already drifting toward systematic wrong competitors near the decision boundary, a hidden fragility that average accuracy metrics miss. The paper formalizes this drift as margin-aware error flow and introduces FragileFlow, a plug-in regularizer that detects these cases with a calibrated margin buffer and assembles the leaking probability into a class-wise vulnerable-risk matrix. Spectral control is then applied to the matrix, yielding the first PAC-Bayes upper bound on the error-flow object. Under an invoked stability condition, this bound supplies a conservative deterministic guarantee on worst-class robustness. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show consistent gains on theory-facing risk measures and perturbed worst-class accuracy while clean accuracy stays intact.

Core claim

The paper claims that correct-but-fragile predictions can be identified by a margin buffer, their off-class probability mass organized into a vulnerable-risk matrix, and the matrix subjected to empirical spectral control; the resulting object admits the first PAC-Bayes upper bound, which under a stability condition yields a conservative deterministic guarantee of worst-class robustness.

What carries the argument

The margin-aware error flow object, defined as the structured probability leakage from true classes in still-correct predictions and assembled into a class-wise vulnerable-risk matrix whose spectral properties are directly regularized by FragileFlow.

If this is right

  • Empirical spectral control improves the proposed theory-facing risk measures over matched baselines.
  • Perturbed worst-class accuracy increases in most tested settings on LLM and VLM adaptation tasks.
  • Clean accuracy is preserved across all comparisons.
  • The PAC-Bayes bound offers a conservative route from observed spectral properties to deterministic worst-class robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same margin-buffer and matrix construction could be adapted to regression or sequence-generation tasks to surface analogous fragile outputs.
  • The vulnerable-risk matrix may expose systematic class-confusion patterns that persist across different perturbation strengths.
  • If the stability condition can be verified or relaxed on new data, the method offers a lightweight post-training route to worst-class guarantees without full retraining.

Load-bearing premise

The stability condition that converts the PAC-Bayes upper bound on the margin-aware error-flow object into a deterministic worst-class robustness guarantee.

What would settle it

A benchmark experiment in which spectral control of the vulnerable-risk matrix is applied yet worst-class accuracy under perturbations fails to rise or the empirical risk measures do not improve over matched baselines.

Figures

Figures reproduced from arXiv: 2605.08896 by Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong, Zhuoyun Li.

Figure 1
Figure 1. Figure 1: Overview of FragileFlow. The figure summarizes the pipeline from finite-option prediction to margin-aware error flow, PAC-Bayes control, deterministic stability, and the final plug-in objective. Section tags indicate where each component is defined. Definition 2.4 (Ungated error-flow matrix). For a fixed model θ, the ungated population error-flow matrix under D′ is (MD′ θ )ij := E(x′ ,y)∼D′ [pθ(i | x ′ ) |… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral compression with and without stability. Each panel plots σmax = VSR [γ against VWR \γ/ √ K. Centroids summarize the base and plug-in runs under the spectral-only and composite objectives. We finally isolate the two components of FragileFlow. The spectral term Rspec di￾rectly penalizes the dominant mode of the margin-aware error-flow matrix, measured by σmax = VSR [γ. The stability term Rstab is no… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of the plug-in to its strength on CLIP ViT-B/32, Caltech101. We sweep along the diagonal β ≈ α/2 for both Pluginouter and Pluginboth. Left: clean accuracy remains nearly flat across α ∈ {0, 0.03, 0.1, 0.3, 1}, suggesting that the plug-in does not introduce a clear utility cost over this range. Right: the vulnerable-flow measure is generally lower for nonzero plug-in strengths than for α = 0, al… view at source ↗
Figure 5
Figure 5. Figure 5: QwenVL cross-model summary on DTD. Delta bars of each LoRA-adv + plug-in variant relative to LoRA-adv on the four reported metrics. Both variants improve the class-structural safety metrics (VSR [γ, VWR \γ) with minimal utility cost, showing that the plug-in is not tied to CLIP ViT [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
read the original abstract

Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes margin-aware error flow as a structured failure mode in which correct predictions in LLMs and VLMs nonetheless exhibit probability mass flowing toward systematic wrong competitors. It introduces FragileFlow, a plug-in regularizer that employs a calibrated margin buffer to identify fragile predictions and organizes off-class mass into a class-wise vulnerable-risk matrix. The central theoretical contribution is a PAC-Bayes upper bound on this error-flow object, which, under an invoked stability condition, is claimed to yield a conservative deterministic guarantee on worst-class robustness via empirical spectral control. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation report consistent gains on theory-facing risk measures, perturbed worst-class accuracy in most settings, and preservation of clean accuracy relative to matched baselines.

Significance. If the derivation and the stability condition can be made rigorous and verifiable, the work supplies a novel lens on robustness that moves beyond average accuracy or consistency metrics toward structured, worst-class guarantees. The combination of a PAC-Bayes bound with a practical spectral regularizer is potentially valuable for safety-critical deployment of foundation models. The reported empirical pattern—improved risk measures without clean-accuracy degradation—is a positive signal, though its reliability cannot yet be assessed from the provided details.

major comments (3)
  1. [Abstract, §3 (theoretical development)] Abstract and theoretical section: The PAC-Bayes upper bound is presented as the first such bound for the margin-aware error-flow object and as supplying a conservative route to deterministic worst-class robustness under a stability condition. No derivation, explicit statement of the bound, or definition of the stability condition appears in the abstract, and the full text does not clarify whether the bound is independent of quantities fitted during regularization or whether the stability condition is enforced by FragileFlow. This renders the central theoretical claim unverifiable and load-bearing.
  2. [Experiments (§4–5)] Experimental section: The claims of consistent improvement on theory-facing risk measures and perturbed worst-class accuracy gains rest on comparisons whose details are absent—no description of how the regularizer or vulnerable-risk matrix is computed, no error bars, no specification of the stability condition's empirical status post-training, and no ablation isolating the contribution of spectral control. These omissions prevent assessment of whether the reported gains are robust or whether the stability condition actually holds.
  3. [§3 (stability condition), §4 (experiments)] §3 and §4: The stability condition is invoked to bridge the PAC-Bayes bound to deterministic worst-class robustness, yet no experiment tests whether the condition is satisfied by models trained with FragileFlow or what occurs under violation. Without such verification, the advertised conservative guarantee cannot be confirmed and the bound may not deliver the claimed robustness.
minor comments (2)
  1. [§2] Notation for the vulnerable-risk matrix and margin-aware error flow should be introduced with explicit equations at first use rather than relying on prose descriptions.
  2. [Abstract, §5] The abstract states that FragileFlow 'preserves clean accuracy across comparisons,' but the experimental tables or figures should report the exact clean-accuracy deltas with standard deviations to substantiate this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and describe the revisions we will make to improve clarity, verifiability, and empirical support.

read point-by-point responses
  1. Referee: Abstract and theoretical section: The PAC-Bayes upper bound is presented as the first such bound for the margin-aware error-flow object and as supplying a conservative route to deterministic worst-class robustness under a stability condition. No derivation, explicit statement of the bound, or definition of the stability condition appears in the abstract, and the full text does not clarify whether the bound is independent of quantities fitted during regularization or whether the stability condition is enforced by FragileFlow. This renders the central theoretical claim unverifiable and load-bearing.

    Authors: We agree that the abstract and main theoretical section require an explicit statement of the bound and a clear definition of the stability condition. The derivation appears in the appendix; the bound is constructed to remain independent of regularization parameters, and FragileFlow is designed to enforce the stability condition via the margin buffer and spectral regularizer. In the revision we will (i) add the explicit bound and stability definition to the abstract and §3, (ii) include a concise derivation sketch in the main text, and (iii) state the independence and enforcement properties explicitly. revision: yes

  2. Referee: Experimental section: The claims of consistent improvement on theory-facing risk measures and perturbed worst-class accuracy gains rest on comparisons whose details are absent—no description of how the regularizer or vulnerable-risk matrix is computed, no error bars, no specification of the stability condition's empirical status post-training, and no ablation isolating the contribution of spectral control. These omissions prevent assessment of whether the reported gains are robust or whether the stability condition actually holds.

    Authors: We acknowledge these omissions. The computation of the regularizer and vulnerable-risk matrix is outlined in §4, but we will expand this with explicit formulas, pseudocode, and implementation details. We will also report error bars from multiple random seeds, the post-training empirical value of the stability condition on each benchmark, and an ablation that isolates the spectral-control term. These additions will allow readers to assess robustness of the gains. revision: yes

  3. Referee: §3 and §4: The stability condition is invoked to bridge the PAC-Bayes bound to deterministic worst-class robustness, yet no experiment tests whether the condition is satisfied by models trained with FragileFlow or what occurs under violation. Without such verification, the advertised conservative guarantee cannot be confirmed and the bound may not deliver the claimed robustness.

    Authors: We agree that direct verification of the stability condition is necessary to support the deterministic guarantee. In the revised manuscript we will add an empirical check that measures whether the condition holds for FragileFlow-trained models on the reported benchmarks and discuss the behavior observed when the condition is mildly violated. This will strengthen the connection between the theoretical bound and the experimental results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper defines the margin-aware error-flow object and FragileFlow regularizer from first principles using a calibrated margin buffer and class-wise vulnerable-risk matrix. It then states a new PAC-Bayes upper bound on this explicitly defined object and connects the bound to worst-class robustness only under an explicitly invoked stability condition. No provided equations or steps show the bound, the spectral control, or the robustness guarantee reducing to fitted parameters, self-citations, or the conclusion by construction. Experiments are reported as separate empirical checks on risk measures and accuracy, not as inputs that force the theoretical claims. The stability condition is treated as an assumption rather than a derived or fitted property, keeping the chain non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The abstract introduces a calibrated margin buffer and a stability condition whose details and justification are not supplied; the vulnerable-risk matrix and margin-aware error flow appear to be newly postulated constructs without independent evidence outside the paper.

free parameters (1)
  • calibrated margin buffer
    Used to identify correct-but-fragile predictions; its calibration procedure and value are not specified in the abstract and are therefore treated as a free parameter of the method.
axioms (1)
  • domain assumption Stability condition under which empirical spectral control yields deterministic worst-class robustness
    Invoked to bridge the PAC-Bayes upper bound to the claimed deterministic robustness guarantee.
invented entities (2)
  • margin-aware error flow no independent evidence
    purpose: To capture the structured probability-mass flow from true classes toward wrong competitors in correct-but-fragile predictions
    Newly formalized object for which the PAC-Bayes bound is derived; no independent evidence outside the paper is mentioned.
  • vulnerable-risk matrix no independent evidence
    purpose: To organize off-class probability mass in a class-wise manner for spectral control
    Introduced as part of the FragileFlow regularizer; no external validation or prior existence is referenced.

pith-pipeline@v0.9.0 · 5482 in / 1730 out tokens · 84907 ms · 2026-05-12T01:07:51.043801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    Better fine-tuning by reducing representational collapse

    Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations, 2021

  2. [2]

    Enhancing LLM robustness to perturbed instructions: An empirical study

    Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Thomas Mensink, and Marek Rei. Enhancing LLM robustness to perturbed instructions: An empirical study. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

  3. [3]

    Stronger generalization bounds for deep nets via a compression approach

    Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 254–263. PMLR, 2018

  4. [4]

    Robustness may be at odds with fairness: An empirical study on class-wise accuracy

    Philipp Benz, Chaoning Zhang, Adil Karjauv, and In So Kweon. Robustness may be at odds with fairness: An empirical study on class-wise accuracy. In Luca Bertinetto, João F. Henriques, Samuel Albanie, Michela Paganini, and Gül Varol, editors,NeurIPS 2020 Workshop on Pre- registration in Machine Learning, volume 148 ofProceedings of Machine Learning Researc...

  5. [5]

    Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

  6. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  7. [7]

    Institute of Mathematical Statistics, Beachwood, Ohio, 2007

    Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, Beachwood, Ohio, 2007

  8. [8]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014

  9. [9]

    Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018. 10

  10. [10]

    Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

  11. [11]

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer Vision and Image Understanding, 106(1):59–70, 2007

  12. [12]

    Sharpness-aware min- imization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

  13. [13]

    Few- shot adversarial low-rank fine-tuning of vision-language models, 2025

    Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, and Ramtin Pedarsani. Few- shot adversarial low-rank fine-tuning of vision-language models, 2025

  14. [14]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

  15. [15]

    Evaluating concurrent robustness of language models across diverse challenge sets

    Vatsal Gupta, Pranshu Pandya, Tushar Kataria, Vivek Gupta, and Dan Roth. Evaluating concurrent robustness of language models across diverse challenge sets. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22162–22184, Miami, Florida, USA, 2024. Association for Computational Linguistics

  16. [16]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  17. [17]

    Tapas are free! training-free adaptation of programmatic agents via llm-guided program synthesis in dynamic environments

    Jinwei Hu, Yi Dong, Youcheng Sun, and Xiaowei Huang. Tapas are free! training-free adaptation of programmatic agents via llm-guided program synthesis in dynamic environments. Proceedings of the AAAI Conference on Artificial Intelligence, 40(35):29477–29485, Mar. 2026

  18. [18]

    Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

    Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, and Xiaowei Huang. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

  19. [19]

    Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model

    Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, and Xiaowei Huang. Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume...

  20. [20]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  21. [21]

    SMART: Robust and efficient fine-tuning for pre-trained natural language models through prin- cipled regularized optimization

    Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. SMART: Robust and efficient fine-tuning for pre-trained natural language models through prin- cipled regularized optimization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190. Association for Computational Linguist...

  22. [22]

    Enhancing robust fairness via confusional spectral regularization

    Gaojie Jin, Sihao Wu, Jiaxu Liu, Tianjin Huang, and Ronghui Mu. Enhancing robust fairness via confusional spectral regularization. InThe Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    Wat: improve the worst-class robustness in adversarial training

    Boqi Li and Weiwei Liu. Wat: improve the worst-class robustness in adversarial training. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI...

  24. [24]

    Where do prompt perturbations break generation? a segment-level view of robustness in lora-tuned language models, 2026

    Zhuoyun Li, Boxuan Wang, Jinwei Hu, Zhenglin Huang, Qisong He, Xinmiao Huang, Guan- gliang Cheng, Xiaowei Huang, and Yi Dong. Where do prompt perturbations break generation? a segment-level view of robustness in lora-tuned language models, 2026

  25. [25]

    PAC-tuning: Fine-tuning pre-trained language models with PAC-driven perturbed gradient descent

    Guangliang Liu, Zhiyu Xue, Xitong Zhang, Kristen Johnson, and Rongrong Wang. PAC-tuning: Fine-tuning pre-trained language models with PAC-driven perturbed gradient descent. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12178–12189, Singapore, December 202...

  26. [26]

    PAC-bayes compression bounds so tight that they can explain generalization

    Sanae Lotfi, Marc Anton Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and Andrew Gordon Wilson. PAC-bayes compression bounds so tight that they can explain generalization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

  27. [27]

    Sanae Lotfi, Marc Anton Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, and An- drew Gordon Wilson. Non-vacuous generalization bounds for large language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofPro- ceedings of Machine Learning Research, pages 32801–32818. PMLR, 2024

  28. [28]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

  29. [29]

    Understanding zero- shot adversarial robustness for large-scale models

    Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero- shot adversarial robustness for large-scale models. InThe Eleventh International Conference on Learning Representations, 2023

  30. [30]

    McAllester

    David A. McAllester. PAC-bayesian model averaging. InProceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT ’99, pages 164–170, New York, NY , USA, 1999. ACM

  31. [31]

    Mistral-7B-Instruct-v0.2

    Mistral AI. Mistral-7B-Instruct-v0.2. https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2, 2023. Hugging Face model card

  32. [32]

    Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi

    John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126. Association for Computational Linguistics, 2020

  33. [33]

    Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience

    Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. InInternational Conference on Learning Representations, 2019

  34. [34]

    A PAC- bayesian approach to spectrally-normalized margin bounds for neural networks

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC- bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

  35. [35]

    Towards calibrated robust fine-tuning of vision-language models

    Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models. InAdvances in Neural Information Processing Systems, volume 37, 2024

  36. [36]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012

  37. [37]

    Prompt perturbation consistency learning for robust language models

    Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1357–1370, St. Julian’s, Malta, 2024. Association for Computational Linguistics

  38. [38]

    Qwen2.5-1.5B-Instruct

    Qwen Team. Qwen2.5-1.5B-Instruct. https://huggingface.co/Qwen/Qwen2.5-1. 5B-Instruct, 2024. Hugging Face model card. 12

  39. [39]

    Qwen2.5 technical report, 2025

    Qwen Team. Qwen2.5 technical report, 2025

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

  41. [41]

    Beyond accuracy: Behavioral testing of NLP models with CheckList

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912. Association for Computational Linguistics, 2020

  42. [42]

    Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision- language models

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision- language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internatio...

  43. [43]

    Improving robust fairness via balance adversarial training

    Chunyu Sun, Chenye Xu, Chengyuan Yao, Siyuan Liang, Yichao Wu, Ding Liang, Xianglong Liu, and Aishan Liu. Improving robust fairness via balance adversarial training. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Ed...

  44. [44]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association...

  45. [45]

    Analysis and applications of class- wise robustness in adversarial training

    Qi Tian, Kun Kuang, Kelu Jiang, Fei Wu, and Yisen Wang. Analysis and applications of class- wise robustness in adversarial training. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 1561–1570, New York, NY , USA,

  46. [46]

    Association for Computing Machinery

  47. [47]

    Adversarial glue: A multi-task benchmark for robustness evaluation of language models

    Boxin Wang, Chejian Xu, Shuohang Wang, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

  48. [48]

    Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

    Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19910–19920, 2025

  49. [49]

    Improving robust generalization by direct PAC-bayesian bound minimization

    Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, and Radu Soricut. Improving robust generalization by direct PAC-bayesian bound minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16458–16468, 2023

  50. [50]

    CFA: Class-wise calibrated fair adversarial training

    Zeming Wei, Yifei Wang, Yiwen Guo, and Yisen Wang. CFA: Class-wise calibrated fair adversarial training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8193–8201, 2023

  51. [51]

    Adversarial weight perturbation helps robust generalization

    Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2958–2969. Curran Associates, Inc., 2020

  52. [52]

    Jain, and Jiliang Tang

    Han Xu, Xiaorui Liu, Yaxin Li, Anil K. Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11492–11501. PMLR, 2021. 13

  53. [53]

    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 7472–

  54. [54]

    PMLR, 09–15 Jun 2019

  55. [55]

    Adversarial prompt tuning for vision-language models

    Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLV, page 56–72, Berlin, Heidelberg, 2024. Springer-Verlag

  56. [56]

    Freelb: Enhanced adversarial training for natural language understanding

    Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. InInternational Conference on Learning Representations, 2020

  57. [57]

    Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, and Xing Xie. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis, LAMPS ’24, page 57–68, Ne...