arxiv: 2605.08896 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong, Zhuoyun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords margin-aware error flowFragileFlowvulnerable-risk matrixPAC-Bayes boundworst-class robustnessspectral controlcorrect-but-fragile predictionsfoundation model adaptation

0 comments

The pith

FragileFlow uses spectral control on margin-aware error flows from correct predictions to deliver deterministic worst-class robustness in foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Foundation models often produce correct answers whose probability mass is already drifting toward systematic wrong competitors near the decision boundary, a hidden fragility that average accuracy metrics miss. The paper formalizes this drift as margin-aware error flow and introduces FragileFlow, a plug-in regularizer that detects these cases with a calibrated margin buffer and assembles the leaking probability into a class-wise vulnerable-risk matrix. Spectral control is then applied to the matrix, yielding the first PAC-Bayes upper bound on the error-flow object. Under an invoked stability condition, this bound supplies a conservative deterministic guarantee on worst-class robustness. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show consistent gains on theory-facing risk measures and perturbed worst-class accuracy while clean accuracy stays intact.

Core claim

The paper claims that correct-but-fragile predictions can be identified by a margin buffer, their off-class probability mass organized into a vulnerable-risk matrix, and the matrix subjected to empirical spectral control; the resulting object admits the first PAC-Bayes upper bound, which under a stability condition yields a conservative deterministic guarantee of worst-class robustness.

What carries the argument

The margin-aware error flow object, defined as the structured probability leakage from true classes in still-correct predictions and assembled into a class-wise vulnerable-risk matrix whose spectral properties are directly regularized by FragileFlow.

If this is right

Empirical spectral control improves the proposed theory-facing risk measures over matched baselines.
Perturbed worst-class accuracy increases in most tested settings on LLM and VLM adaptation tasks.
Clean accuracy is preserved across all comparisons.
The PAC-Bayes bound offers a conservative route from observed spectral properties to deterministic worst-class robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same margin-buffer and matrix construction could be adapted to regression or sequence-generation tasks to surface analogous fragile outputs.
The vulnerable-risk matrix may expose systematic class-confusion patterns that persist across different perturbation strengths.
If the stability condition can be verified or relaxed on new data, the method offers a lightweight post-training route to worst-class guarantees without full retraining.

Load-bearing premise

The stability condition that converts the PAC-Bayes upper bound on the margin-aware error-flow object into a deterministic worst-class robustness guarantee.

What would settle it

A benchmark experiment in which spectral control of the vulnerable-risk matrix is applied yet worst-class accuracy under perturbations fails to rise or the empirical risk measures do not improve over matched baselines.

Figures

Figures reproduced from arXiv: 2605.08896 by Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong, Zhuoyun Li.

**Figure 3.** Figure 3: Spectral compression with and without stability. Each panel plots σmax = VSR [γ against VWR \γ/ √ K. Centroids summarize the base and plug-in runs under the spectral-only and composite objectives. We finally isolate the two components of FragileFlow. The spectral term Rspec directly penalizes the dominant mode of the margin-aware error-flow matrix, measured by σmax = VSR [γ. The stability term Rstab is no… view at source ↗

**Figure 4.** Figure 4: Sensitivity of the plug-in to its strength on CLIP ViT-B/32, Caltech101. We sweep along the diagonal β ≈ α/2 for both Pluginouter and Pluginboth. Left: clean accuracy remains nearly flat across α ∈ {0, 0.03, 0.1, 0.3, 1}, suggesting that the plug-in does not introduce a clear utility cost over this range. Right: the vulnerable-flow measure is generally lower for nonzero plug-in strengths than for α = 0, al… view at source ↗

**Figure 5.** Figure 5: QwenVL cross-model summary on DTD. Delta bars of each LoRA-adv + plug-in variant relative to LoRA-adv on the four reported metrics. Both variants improve the class-structural safety metrics (VSR [γ, VWR \γ) with minimal utility cost, showing that the plug-in is not tied to CLIP ViT [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FragileFlow formalizes margin-aware error flow and adds a regularizer plus PAC-Bayes bound for it, but the stability condition needed to reach deterministic worst-class robustness stays unverified.

read the letter

The paper's main move is to name the case where a model is still correct on a sample but probability mass is already drifting toward one or two specific wrong classes near the boundary. They call this margin-aware error flow, build a class-wise vulnerable-risk matrix from it using a calibrated margin buffer, and introduce FragileFlow as a plug-in regularizer that applies spectral control to that matrix. They also claim the first PAC-Bayes upper bound on the error-flow object and argue that controlling its spectrum gives a conservative route to deterministic worst-class robustness, provided a stability condition holds. That formalization and the bound are the genuinely new pieces; prior robustness work has looked at margins and spectral norms, but not this exact object or bound for it. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show consistent gains on the theory-facing risk measures, better perturbed worst-class accuracy in most settings, and no loss in clean accuracy versus matched baselines. Those results are useful and suggest the regularizer is practical. The soft spot is the stability condition. It is invoked to connect the bound to the advertised robustness guarantee, yet the experiments do not check whether the condition is satisfied after regularization or what happens if it is not. Without that check the deterministic guarantee remains conditional. The paper would also be stronger with the full derivation of the bound, explicit implementation details for the matrix, and error bars or significance tests on the reported gains. This is for people working on fine-grained robustness of LLMs and VLMs rather than average-case metrics. It has enough distinct formalization and relevant experiments to deserve a serious referee, even though reviewers will need to press on the stability link and the bound's tightness.

Referee Report

3 major / 2 minor

Summary. The paper formalizes margin-aware error flow as a structured failure mode in which correct predictions in LLMs and VLMs nonetheless exhibit probability mass flowing toward systematic wrong competitors. It introduces FragileFlow, a plug-in regularizer that employs a calibrated margin buffer to identify fragile predictions and organizes off-class mass into a class-wise vulnerable-risk matrix. The central theoretical contribution is a PAC-Bayes upper bound on this error-flow object, which, under an invoked stability condition, is claimed to yield a conservative deterministic guarantee on worst-class robustness via empirical spectral control. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation report consistent gains on theory-facing risk measures, perturbed worst-class accuracy in most settings, and preservation of clean accuracy relative to matched baselines.

Significance. If the derivation and the stability condition can be made rigorous and verifiable, the work supplies a novel lens on robustness that moves beyond average accuracy or consistency metrics toward structured, worst-class guarantees. The combination of a PAC-Bayes bound with a practical spectral regularizer is potentially valuable for safety-critical deployment of foundation models. The reported empirical pattern—improved risk measures without clean-accuracy degradation—is a positive signal, though its reliability cannot yet be assessed from the provided details.

major comments (3)

[Abstract, §3 (theoretical development)] Abstract and theoretical section: The PAC-Bayes upper bound is presented as the first such bound for the margin-aware error-flow object and as supplying a conservative route to deterministic worst-class robustness under a stability condition. No derivation, explicit statement of the bound, or definition of the stability condition appears in the abstract, and the full text does not clarify whether the bound is independent of quantities fitted during regularization or whether the stability condition is enforced by FragileFlow. This renders the central theoretical claim unverifiable and load-bearing.
[Experiments (§4–5)] Experimental section: The claims of consistent improvement on theory-facing risk measures and perturbed worst-class accuracy gains rest on comparisons whose details are absent—no description of how the regularizer or vulnerable-risk matrix is computed, no error bars, no specification of the stability condition's empirical status post-training, and no ablation isolating the contribution of spectral control. These omissions prevent assessment of whether the reported gains are robust or whether the stability condition actually holds.
[§3 (stability condition), §4 (experiments)] §3 and §4: The stability condition is invoked to bridge the PAC-Bayes bound to deterministic worst-class robustness, yet no experiment tests whether the condition is satisfied by models trained with FragileFlow or what occurs under violation. Without such verification, the advertised conservative guarantee cannot be confirmed and the bound may not deliver the claimed robustness.

minor comments (2)

[§2] Notation for the vulnerable-risk matrix and margin-aware error flow should be introduced with explicit equations at first use rather than relying on prose descriptions.
[Abstract, §5] The abstract states that FragileFlow 'preserves clean accuracy across comparisons,' but the experimental tables or figures should report the exact clean-accuracy deltas with standard deviations to substantiate this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and describe the revisions we will make to improve clarity, verifiability, and empirical support.

read point-by-point responses

Referee: Abstract and theoretical section: The PAC-Bayes upper bound is presented as the first such bound for the margin-aware error-flow object and as supplying a conservative route to deterministic worst-class robustness under a stability condition. No derivation, explicit statement of the bound, or definition of the stability condition appears in the abstract, and the full text does not clarify whether the bound is independent of quantities fitted during regularization or whether the stability condition is enforced by FragileFlow. This renders the central theoretical claim unverifiable and load-bearing.

Authors: We agree that the abstract and main theoretical section require an explicit statement of the bound and a clear definition of the stability condition. The derivation appears in the appendix; the bound is constructed to remain independent of regularization parameters, and FragileFlow is designed to enforce the stability condition via the margin buffer and spectral regularizer. In the revision we will (i) add the explicit bound and stability definition to the abstract and §3, (ii) include a concise derivation sketch in the main text, and (iii) state the independence and enforcement properties explicitly. revision: yes
Referee: Experimental section: The claims of consistent improvement on theory-facing risk measures and perturbed worst-class accuracy gains rest on comparisons whose details are absent—no description of how the regularizer or vulnerable-risk matrix is computed, no error bars, no specification of the stability condition's empirical status post-training, and no ablation isolating the contribution of spectral control. These omissions prevent assessment of whether the reported gains are robust or whether the stability condition actually holds.

Authors: We acknowledge these omissions. The computation of the regularizer and vulnerable-risk matrix is outlined in §4, but we will expand this with explicit formulas, pseudocode, and implementation details. We will also report error bars from multiple random seeds, the post-training empirical value of the stability condition on each benchmark, and an ablation that isolates the spectral-control term. These additions will allow readers to assess robustness of the gains. revision: yes
Referee: §3 and §4: The stability condition is invoked to bridge the PAC-Bayes bound to deterministic worst-class robustness, yet no experiment tests whether the condition is satisfied by models trained with FragileFlow or what occurs under violation. Without such verification, the advertised conservative guarantee cannot be confirmed and the bound may not deliver the claimed robustness.

Authors: We agree that direct verification of the stability condition is necessary to support the deterministic guarantee. In the revised manuscript we will add an empirical check that measures whether the condition holds for FragileFlow-trained models on the reported benchmarks and discuss the behavior observed when the condition is mildly violated. This will strengthen the connection between the theoretical bound and the experimental results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper defines the margin-aware error-flow object and FragileFlow regularizer from first principles using a calibrated margin buffer and class-wise vulnerable-risk matrix. It then states a new PAC-Bayes upper bound on this explicitly defined object and connects the bound to worst-class robustness only under an explicitly invoked stability condition. No provided equations or steps show the bound, the spectral control, or the robustness guarantee reducing to fitted parameters, self-citations, or the conclusion by construction. Experiments are reported as separate empirical checks on risk measures and accuracy, not as inputs that force the theoretical claims. The stability condition is treated as an assumption rather than a derived or fitted property, keeping the chain non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The abstract introduces a calibrated margin buffer and a stability condition whose details and justification are not supplied; the vulnerable-risk matrix and margin-aware error flow appear to be newly postulated constructs without independent evidence outside the paper.

free parameters (1)

calibrated margin buffer
Used to identify correct-but-fragile predictions; its calibration procedure and value are not specified in the abstract and are therefore treated as a free parameter of the method.

axioms (1)

domain assumption Stability condition under which empirical spectral control yields deterministic worst-class robustness
Invoked to bridge the PAC-Bayes upper bound to the claimed deterministic robustness guarantee.

invented entities (2)

margin-aware error flow no independent evidence
purpose: To capture the structured probability-mass flow from true classes toward wrong competitors in correct-but-fragile predictions
Newly formalized object for which the PAC-Bayes bound is derived; no independent evidence outside the paper is mentioned.
vulnerable-risk matrix no independent evidence
purpose: To organize off-class probability mass in a class-wise manner for spectral control
Introduced as part of the FragileFlow regularizer; no external validation or prior existence is referenced.

pith-pipeline@v0.9.0 · 5482 in / 1730 out tokens · 84907 ms · 2026-05-12T01:07:51.043801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

[1]

Better fine-tuning by reducing representational collapse

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations, 2021

work page 2021
[2]

Enhancing LLM robustness to perturbed instructions: An empirical study

Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Thomas Mensink, and Marek Rei. Enhancing LLM robustness to perturbed instructions: An empirical study. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

work page 2025
[3]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 254–263. PMLR, 2018

work page 2018
[4]

Robustness may be at odds with fairness: An empirical study on class-wise accuracy

Philipp Benz, Chaoning Zhang, Adil Karjauv, and In So Kweon. Robustness may be at odds with fairness: An empirical study on class-wise accuracy. In Luca Bertinetto, João F. Henriques, Samuel Albanie, Michela Paganini, and Gül Varol, editors,NeurIPS 2020 Workshop on Pre- registration in Machine Learning, volume 148 ofProceedings of Machine Learning Researc...

work page 2020
[5]

Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

work page 2021
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901
[7]

Institute of Mathematical Statistics, Beachwood, Ohio, 2007

Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, Beachwood, Ohio, 2007

work page 2007
[8]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014

work page 2014
[9]

Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018. 10

work page 2018
[10]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

work page 2017
[11]

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer Vision and Image Understanding, 106(1):59–70, 2007

work page 2007
[12]

Sharpness-aware min- imization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

work page 2021
[13]

Few- shot adversarial low-rank fine-tuning of vision-language models, 2025

Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, and Ramtin Pedarsani. Few- shot adversarial low-rank fine-tuning of vision-language models, 2025

work page 2025
[14]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

work page 2015
[15]

Evaluating concurrent robustness of language models across diverse challenge sets

Vatsal Gupta, Pranshu Pandya, Tushar Kataria, Vivek Gupta, and Dan Roth. Evaluating concurrent robustness of language models across diverse challenge sets. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22162–22184, Miami, Florida, USA, 2024. Association for Computational Linguistics

work page 2024
[16]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[17]

Tapas are free! training-free adaptation of programmatic agents via llm-guided program synthesis in dynamic environments

Jinwei Hu, Yi Dong, Youcheng Sun, and Xiaowei Huang. Tapas are free! training-free adaptation of programmatic agents via llm-guided program synthesis in dynamic environments. Proceedings of the AAAI Conference on Artificial Intelligence, 40(35):29477–29485, Mar. 2026

work page 2026
[18]

Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, and Xiaowei Huang. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

work page arXiv 2026
[19]

Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model

Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, and Xiaowei Huang. Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume...

work page 2025
[20]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[21]

SMART: Robust and efficient fine-tuning for pre-trained natural language models through prin- cipled regularized optimization

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. SMART: Robust and efficient fine-tuning for pre-trained natural language models through prin- cipled regularized optimization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190. Association for Computational Linguist...

work page 2020
[22]

Enhancing robust fairness via confusional spectral regularization

Gaojie Jin, Sihao Wu, Jiaxu Liu, Tianjin Huang, and Ronghui Mu. Enhancing robust fairness via confusional spectral regularization. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Wat: improve the worst-class robustness in adversarial training

Boqi Li and Weiwei Liu. Wat: improve the worst-class robustness in adversarial training. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI...

work page 2023
[24]

Where do prompt perturbations break generation? a segment-level view of robustness in lora-tuned language models, 2026

Zhuoyun Li, Boxuan Wang, Jinwei Hu, Zhenglin Huang, Qisong He, Xinmiao Huang, Guan- gliang Cheng, Xiaowei Huang, and Yi Dong. Where do prompt perturbations break generation? a segment-level view of robustness in lora-tuned language models, 2026

work page 2026
[25]

PAC-tuning: Fine-tuning pre-trained language models with PAC-driven perturbed gradient descent

Guangliang Liu, Zhiyu Xue, Xitong Zhang, Kristen Johnson, and Rongrong Wang. PAC-tuning: Fine-tuning pre-trained language models with PAC-driven perturbed gradient descent. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12178–12189, Singapore, December 202...

work page 2023
[26]

PAC-bayes compression bounds so tight that they can explain generalization

Sanae Lotfi, Marc Anton Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and Andrew Gordon Wilson. PAC-bayes compression bounds so tight that they can explain generalization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022
[27]

Sanae Lotfi, Marc Anton Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, and An- drew Gordon Wilson. Non-vacuous generalization bounds for large language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofPro- ceedings of Machine Learning Research, pages 32801–32818. PMLR, 2024

work page 2024
[28]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

work page 2018
[29]

Understanding zero- shot adversarial robustness for large-scale models

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero- shot adversarial robustness for large-scale models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[30]

McAllester

David A. McAllester. PAC-bayesian model averaging. InProceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT ’99, pages 164–170, New York, NY , USA, 1999. ACM

work page 1999
[31]

Mistral-7B-Instruct-v0.2

Mistral AI. Mistral-7B-Instruct-v0.2. https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2, 2023. Hugging Face model card

work page 2023
[32]

Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi

John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126. Association for Computational Linguistics, 2020

work page 2020
[33]

Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience

Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. InInternational Conference on Learning Representations, 2019

work page 2019
[34]

A PAC- bayesian approach to spectrally-normalized margin bounds for neural networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC- bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[35]

Towards calibrated robust fine-tuning of vision-language models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[36]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012

work page 2012
[37]

Prompt perturbation consistency learning for robust language models

Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1357–1370, St. Julian’s, Malta, 2024. Association for Computational Linguistics

work page 2024
[38]

Qwen2.5-1.5B-Instruct

Qwen Team. Qwen2.5-1.5B-Instruct. https://huggingface.co/Qwen/Qwen2.5-1. 5B-Instruct, 2024. Hugging Face model card. 12

work page 2024
[39]

Qwen2.5 technical report, 2025

Qwen Team. Qwen2.5 technical report, 2025

work page 2025
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

work page 2021
[41]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912. Association for Computational Linguistics, 2020

work page 2020
[42]

Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision- language models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision- language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internatio...

work page 2024
[43]

Improving robust fairness via balance adversarial training

Chunyu Sun, Chenye Xu, Chengyuan Yao, Siyuan Liang, Yichao Wu, Ding Liang, Xianglong Liu, and Aishan Liu. Improving robust fairness via balance adversarial training. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Ed...

work page 2023
[44]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association...

work page 2019
[45]

Analysis and applications of class- wise robustness in adversarial training

Qi Tian, Kun Kuang, Kelu Jiang, Fei Wu, and Yisen Wang. Analysis and applications of class- wise robustness in adversarial training. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 1561–1570, New York, NY , USA,

work page
[46]

Association for Computing Machinery

work page
[47]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Boxin Wang, Chejian Xu, Shuohang Wang, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[48]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19910–19920, 2025

work page 2025
[49]

Improving robust generalization by direct PAC-bayesian bound minimization

Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, and Radu Soricut. Improving robust generalization by direct PAC-bayesian bound minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16458–16468, 2023

work page 2023
[50]

CFA: Class-wise calibrated fair adversarial training

Zeming Wei, Yifei Wang, Yiwen Guo, and Yisen Wang. CFA: Class-wise calibrated fair adversarial training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8193–8201, 2023

work page 2023
[51]

Adversarial weight perturbation helps robust generalization

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2958–2969. Curran Associates, Inc., 2020

work page 2020
[52]

Jain, and Jiliang Tang

Han Xu, Xiaorui Liu, Yaxin Li, Anil K. Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11492–11501. PMLR, 2021. 13

work page 2021
[53]

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 7472–

work page
[54]

PMLR, 09–15 Jun 2019

work page 2019
[55]

Adversarial prompt tuning for vision-language models

Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLV, page 56–72, Berlin, Heidelberg, 2024. Springer-Verlag

work page 2024
[56]

Freelb: Enhanced adversarial training for natural language understanding

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. InInternational Conference on Learning Representations, 2020

work page 2020
[57]

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, and Xing Xie. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis, LAMPS ’24, page 57–68, Ne...

work page arXiv 2024