arxiv: 2604.15022 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: unknown

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Enyan Dai, Haochun Tang, Huaxiao Liu, Jiahua Lu, Yuliang Yan

Pith reviewed 2026-05-10 10:54 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords adversarial attackLLM routercost-aware routingblack-box attackadversarial suffixmodel selectionLLM security

0 comments

The pith

Adversarial suffixes optimized on an ensemble surrogate can force black-box LLM routers to select expensive models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R²A, a method that attacks cost-aware LLM routers by first building a hybrid ensemble of open-source models to approximate the unknown black-box router's decisions. It then runs a suffix optimization procedure on this surrogate to generate short adversarial appendages that steer queries toward high-capability, high-cost models. A sympathetic reader would care because dynamic routing is promoted as a way to control inference spend, yet this attack shows the routing logic itself can be gamed in fully black-box settings. Experiments across open-source and commercial routers indicate the approach raises the fraction of queries routed to expensive models for varied input distributions.

Core claim

R²A constructs a hybrid ensemble surrogate router from multiple open-source models to replicate the behavior of an inaccessible black-box router, then adapts a suffix optimization algorithm to produce adversarial strings that increase the probability the black-box router dispatches queries to expensive high-capability models.

What carries the argument

The hybrid ensemble surrogate router, built from open-source models, that approximates the black-box router's routing decisions so that suffix optimization can be performed without direct access to the target.

Load-bearing premise

A collection of open-source models can be combined into an ensemble that closely enough reproduces the routing choices of the unknown commercial or black-box router.

What would settle it

Running the generated suffixes on a production router whose decision boundary differs substantially from all models in the surrogate ensemble and observing no measurable rise in expensive-model routing rate.

Figures

Figures reproduced from arXiv: 2604.15022 by Enyan Dai, Haochun Tang, Huaxiao Liu, Jiahua Lu, Yuliang Yan.

**Figure 2.** Figure 2: Framework of our R2A. (a) We design a hybrid ensemble surrogate router, including a lightweight router and an ensembled router. (b) Our R2A pipeline consists of surrogate model training, followed by suffix optimization. where Wo ∈ R |Muni|×|Mt| is the projection matrix, and z (k) o denotes the logits of router R (k) o projected onto the target router’s model pool space. With Eq.(5) and Eq.(3), we can get t… view at source ↗

**Figure 3.** Figure 3: Inference cost comparisons after attacks. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of fingerprinting scores with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of on GPT-5: the router switches [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance analysis showing Accuracy (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A-Attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes R²A, an attack on cost-aware LLM routers that constructs a hybrid ensemble surrogate from open-source models, optimizes adversarial suffixes against this surrogate, and claims the resulting suffixes transfer to black-box commercial routers, significantly increasing the rate at which queries are routed to expensive high-capability models across varied query distributions.

Significance. If the transfer results are robustly demonstrated, the work would establish a practical black-box attack vector on production LLM routing systems, highlighting a previously under-explored security risk in cost-optimization infrastructure. The hybrid-surrogate approach is a standard adversarial-ML technique applied here to a new target, but its value depends on verifiable transfer.

major comments (3)

[Experiments / Method (surrogate construction and transfer evaluation)] The central claim of successful black-box transfer rests on the unverified assumption that the hybrid ensemble surrogate sufficiently approximates the unknown commercial routers. No quantitative surrogate-fidelity metric (e.g., per-router agreement rate on routing decisions over a held-out query set) is reported, nor is there an ablation isolating transfer success from coincidental prompt effects. This directly undermines the ability to attribute observed routing-rate increases to the optimized suffixes rather than query distribution alone.
[Experiments section] The abstract and results claim 'significantly increases the routing rate' on multiple systems, yet the manuscript provides no baselines, statistical details (confidence intervals, p-values, number of trials), or failure-case analysis. Without these, it is impossible to determine effect size or reproducibility of the reported increases.
[Method (suffix optimization algorithm)] The suffix-optimization procedure is adapted for the ensemble surrogate, but the paper does not specify how the ensemble loss is aggregated (e.g., majority vote, averaged logits, or routing-probability product) or whether the optimization is performed jointly or sequentially across surrogate members. This detail is load-bearing for reproducibility of the attack.

minor comments (2)

[Abstract / Introduction] The notation R$^2$A is introduced without an explicit expansion or acronym definition in the abstract or introduction.
[Experiments] The GitHub link is provided but the manuscript does not state whether the released code includes the exact surrogate configurations, optimization hyperparameters, and evaluation scripts used for the commercial-router experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on surrogate fidelity, statistical rigor, and methodological clarity. We address each major comment below and will revise the manuscript accordingly to improve verifiability and reproducibility.

read point-by-point responses

Referee: [Experiments / Method (surrogate construction and transfer evaluation)] The central claim of successful black-box transfer rests on the unverified assumption that the hybrid ensemble surrogate sufficiently approximates the unknown commercial routers. No quantitative surrogate-fidelity metric (e.g., per-router agreement rate on routing decisions over a held-out query set) is reported, nor is there an ablation isolating transfer success from coincidental prompt effects. This directly undermines the ability to attribute observed routing-rate increases to the optimized suffixes rather than query distribution alone.

Authors: We agree that a quantitative surrogate-fidelity metric and ablation would strengthen attribution of the transfer results. In the revised manuscript, we will add per-router agreement rates computed on a held-out query set for each commercial router, along with an ablation study comparing routing rates under original queries, random suffixes, and our optimized suffixes. This will help isolate the contribution of the adversarial suffixes. revision: yes
Referee: [Experiments section] The abstract and results claim 'significantly increases the routing rate' on multiple systems, yet the manuscript provides no baselines, statistical details (confidence intervals, p-values, number of trials), or failure-case analysis. Without these, it is impossible to determine effect size or reproducibility of the reported increases.

Authors: We acknowledge that additional statistical details and baselines are necessary for assessing effect size and reproducibility. We will incorporate baseline comparisons (original queries and heuristic prompts), results aggregated over multiple independent trials (with explicit trial counts), 95% confidence intervals, p-values from appropriate statistical tests, and a dedicated failure-case analysis in the revised experiments section. revision: yes
Referee: [Method (suffix optimization algorithm)] The suffix-optimization procedure is adapted for the ensemble surrogate, but the paper does not specify how the ensemble loss is aggregated (e.g., majority vote, averaged logits, or routing-probability product) or whether the optimization is performed jointly or sequentially across surrogate members. This detail is load-bearing for reproducibility of the attack.

Authors: We appreciate the call for explicit details on the ensemble procedure. The loss is aggregated by averaging the routing probabilities across surrogate models, and optimization is performed jointly via gradients through the full ensemble. We will expand the Method section with a precise description of the aggregation, joint optimization steps, and pseudocode to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction with independent surrogate and experimental validation

full rationale

The paper describes an empirical attack method (R²A) that builds a hybrid ensemble surrogate from open-source models and optimizes adversarial suffixes against it before testing transfer to black-box routers. No equations, derivations, fitted parameters, or first-principles claims are present that reduce to the inputs by construction. The central result is measured via direct experiments on routing rates across query distributions and router types; the surrogate is constructed independently rather than defined in terms of the target outcome. Self-citations, if any, are not load-bearing for any derivation. This is a standard empirical security paper whose validity rests on experimental evidence rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an ensemble surrogate can approximate black-box router behavior for transfer of the optimized suffix; no free parameters or new entities are introduced beyond standard adversarial ML components.

axioms (1)

domain assumption Hybrid ensemble surrogate router can sufficiently mimic the black-box router for adversarial optimization transfer
Invoked to justify the black-box attack feasibility without direct access.

pith-pipeline@v0.9.0 · 5474 in / 1054 out tokens · 52479 ms · 2026-05-10T10:54:28.991629+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

Reference graph

Works this paper leans on

43 extracted references · 18 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. 2025. https://arxiv.org/abs/2310.12963 Automix: Automatically mixing language models . Preprint, arXiv:2310.12963

work page arXiv 2025
[4]

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. https://doi.org/10.18653/v1/2024.acl-long.401 MT -bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues . In Proceedings of the 62nd Annual Meeting of the Association for Com...

work page doi:10.18653/v1/2024.acl-long.401 2024
[5]

Xiaofan Bai, Pingyi Hu, Xiaojing Ma, Linchen Yu, Dongmei Zhang, Qi Zhang, and Bin Benjamin Zhu. 2025. https://doi.org/10.18653/v1/2025.findings-acl.546 ESF : Efficient sensitive fingerprinting for black-box tamper detection of large language models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 10477--10494, Vienna, Austri...

work page doi:10.18653/v1/2025.findings-acl.546 2025
[6]

Lingjiao Chen, Matei Zaharia, and James Zou. 2024 a . Frugalgpt: How to use large language models while reducing cost and improving performance. Transactions on Machine Learning Research

2024
[7]

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. 2024 b . Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305--66328

2024
[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Enyan Dai and Suhang Wang. 2022. Learning fair graph neural networks with limited and private sensitive attribute information. IEEE Transactions on Knowledge and Data Engineering, 35(7):7103--7117

2022
[10]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R \" u hle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. https://openreview.net/forum?id=02f3mUtqnM Hybrid LLM: cost-efficient and quality-aware query routing . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11...

2024
[11]

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2018
[12]

Tao Feng, Yanzhen Shen, and Jiaxuan You. 2024. Graphrouter: A graph-based router for llm selections. In The Thirteenth International Conference on Learning Representations

2024
[13]

Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Jiaxuan You. 2025. Fusionfactory: Fusing llm capabilities with multi-llm log data. arXiv preprint arXiv:2507.10540

work page arXiv 2025
[14]

arXiv preprint arXiv:2502.14855 , year=

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, and Ion Stoica. 2025. https://arxiv.org/abs/2502.14855 Prompt-to-leaderboard . Preprint, arXiv:2502.14855

work page arXiv 2025
[15]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. L ight RAG : Simple and fast retrieval-augmented generation. In Findings of the Association for Computational Linguistics: EMNLP 2025

2025
[16]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

2021
[17]

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR)

2022
[18]

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. https://openreview.net/forum?id=IVXmV8Uxwh Routerbench: A benchmark for multi- LLM routing system . In Agentic Markets Workshop at ICML 2024

2024
[19]

Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tram \`e r, and Chiyuan Zhang

Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tram \`e r, and Chiyuan Zhang. 2025 a . https://openreview.net/forum?id=zf9zwCRKyP Exploring and mitigating adversarial manipulation of voting-based le...

2025
[20]

Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, and Liang Lin. 2025 b . https://doi.org/10.18653/v1/2025.findings-emnlp.208 R outer E val: A comprehensive benchmark for routing LLM s to explore model-level scaling up in LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3860--3887, Su...

work page doi:10.18653/v1/2025.findings-emnlp.208 2025
[21]

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)

2023
[22]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Aly M Kassem, Bernhard Sch \"o lkopf, and Zhijing Jin. 2025. How robust are router-llms? analysis of the fragility of llm routing capabilities. arXiv preprint arXiv:2504.07113

work page arXiv 2025
[24]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems (NeurIPS)

2022
[25]

Chenao Li, Shuo Yan, and Enyan Dai. 2025 a . https://openreview.net/forum?id=5cgm5dV5hr Unizyme: A unified protein cleavage site predictor enhanced with enzyme active-site knowledge . In Advances in Neural Information Processing Systems (NeurIPS)

2025
[26]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2025 b . https://openreview.net/forum?id=KfTf9vFvSn From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . In Forty-second International Conference on Machine Learning

2025
[27]

Minhua Lin, Enyan Dai, Junjie Xu, Jinyuan Jia, Xiang Zhang, and Suhang Wang. 2025 a . Stealing training graphs from graph neural networks. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 777--788

2025
[28]

Qiqi Lin, Xiaoyang Ji, Shengfang Zhai, Qingni Shen, Zhi Zhang, Yuejian Fang, and Yansong Gao. 2025 b . Life-cycle routing vulnerabilities of llm router. arXiv preprint arXiv:2503.08704

work page arXiv 2025
[29]

Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations (ICLR)

2017
[30]

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2024. https://doi.org/10.18653/v1/2024.naacl-long.109 Routing to the expert: Efficient reward-guided ensemble of large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...

work page doi:10.18653/v1/2024.naacl-long.109 2024
[31]

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. 2025. https://arxiv.org/abs/2510.00202 Routerarena: An open platform for comprehensive comparison of llm routers . Preprint, arXiv:2510.00202

work page arXiv 2025
[32]

Hope McGovern, Rickard Stureborg, Yoshi Suhara, and Dimitris Alikaniotis. 2025. https://aclanthology.org/2025.genaidetect-1.6/ Your large language models are leaving fingerprints . In Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect), pages 85--95, Abu Dhabi, UAE. International Conference on Computational Linguistics

2025
[33]

Gonzalez, M Waleed Kadous, and Ion Stoica

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. 2025. https://openreview.net/forum?id=8sSqNntaMr Route LLM : Learning to route LLM s from preference data . In The Thirteenth International Conference on Learning Representations

2025
[34]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. https://openreview.net/forum?id=laPAh2hRFC Smoothllm: Defending large language models against jailbreaking attacks . Trans. Mach. Learn. Res., 2025

2025
[35]

Avital Shafran, Roei Schuster, Tom Ristenpart, and Vitaly Shmatikov. 2025. Rerouting LLM routers. In Conference on Language Modeling (COLM)

2025
[36]

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. 2025. https://arxiv.org/abs/2510.09719 Icl-router: In-context learned model representations for llm routing . In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Poster

work page arXiv 2025
[37]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA. Curran Associates Inc

2020
[38]

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. https://arxiv.org/abs/2411.04368 Measuring short-form factuality in large language models . Preprint, arXiv:2411.04368

work page arXiv 2024
[39]

Yuliang Yan, Haochun Tang, Shuo Yan, and Enyan Dai. 2025. Duffin: A dual-level fingerprinting framework for llms ip protection. arXiv preprint arXiv:2505.16530

work page arXiv 2025
[40]

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, and Shuyue Hu. 2025. https://doi.org/10.1145/3772429.3772445 Beyond gpt-5: Making llms cheaper and better via performance-efficiency optimized routing . In Proceedings of the 2025 7th International Conference on Distributed Artificial Intelligence, DAI '25, page 122–129, New York, NY, USA...

work page doi:10.1145/3772429.3772445 2025
[41]

Morley Mao

Zesen Zhao, Shuowei Jin, and Z. Morley Mao. 2024. https://arxiv.org/abs/2409.15518 Eagle: Efficient training-free router for multi-llm inference . Preprint, arXiv:2409.15518

work page arXiv 2024
[42]

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. 2025. https://openreview.net/forum?id=Fs9EabmQrJ Embed LLM : Learning compact representations of large language models . In The Thirteenth International Conference on Learning Representations

2025
[43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023