Latent adversarial training improves robustness to persistent harmful behaviors in llms.Transactions on Machine Learning Research, (07/2025)

Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

cs.AI · 2025-11-11 · unverdicted · novelty 5.0

Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.

citing papers explorer

Showing 1 of 1 citing paper.

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models cs.AI · 2025-11-11 · unverdicted · none · ref 15
Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.

Latent adversarial training improves robustness to persistent harmful behaviors in llms.Transactions on Machine Learning Research, (07/2025)

fields

years

verdicts

representative citing papers

citing papers explorer