Benchmarking Misuse Mitigation Against Covert Adversaries

Alexander Robey; Davis Brown; Eric Wong; George J. Pappas; Hamed Hassani; Luze Sun; Mahdi Sabbaghi

arxiv: 2506.06414 · v2 · submitted 2025-06-06 · 💻 cs.CR · cs.AI

Benchmarking Misuse Mitigation Against Covert Adversaries

Davis Brown , Mahdi Sabbaghi , Luze Sun , Alexander Robey , George J. Pappas , Eric Wong , Hamed Hassani This is my paper

Pith reviewed 2026-05-19 10:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords language model safetycovert attacksdecomposition attacksstateful defensesmisuse mitigationadversarial benchmarkingquery fragmentation

0 comments

The pith

Decomposition attacks enable covert misuse of language models by breaking dangerous tasks into many separate benign queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a data generation pipeline to benchmark how language models respond to covert attacks, where an adversary splits a harmful goal into small, harmless-seeming fragments sent one at a time. Each individual query looks safe, so standard filters miss it, but the fragments together supply the information needed to complete risky tasks. By creating datasets that frontier models refuse outright while remaining too hard for open-weight models, the work demonstrates that these decomposition strategies succeed at uplifting misuse and that defenses which track conversation state across queries can block them. Current evaluations focus on single overt requests, leaving this distributed threat unmeasured.

Core claim

We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to highlight stateful defenses as a promising countermeasure.

What carries the argument

The Benchmarks for Stateful Defenses (BSD) pipeline, which automates the creation and cross-query combination of task fragments to test detection of accumulated misuse signals.

Load-bearing premise

The curated datasets and automated pipeline accurately capture real-world covert attack strategies and that combining fragments across queries reliably uplifts misuse in deployed models.

What would settle it

Running the decomposition fragments through a production model and measuring whether the attacker can complete the target harmful task at a higher rate than with single queries, or whether adding stateful tracking across queries fails to lower that rate.

Figures

Figures reproduced from arXiv: 2506.06414 by Alexander Robey, Davis Brown, Eric Wong, George J. Pappas, Hamed Hassani, Luze Sun, Mahdi Sabbaghi.

**Figure 2.** Figure 2: Our pipeline to generate hard, refused, answerable questions. First, we use a strong unaligned model (GPT-4.1 [38]) to modify a question from an existing dataset [39] to be both unsafe and difficult. We then filter for (a) questions with answers unaminously agreed on by other frontier models (‘answerability’) [40], (b) refusal by safety-trained models, and (c) for difficulty. Measuring misuse uplift—the in… view at source ↗

**Figure 3.** Figure 3: BSD has difficult questions, compared to other biology and misuse evaluations [ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Decompositions are harder to identify than jailbreaks per-input. ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Benign prompts push the precision of an adversarially trained Llama-Guard classifier to zero [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: (Left, BSD) In the refusal setting with BSD, decomposition accuracy steadily improves as the number of decompositions increases. The baseline gets no answer from the strong model (GPT4.1). (Right, WMDP) On a similar dataset where strong models do not refuse (WMDP-Bio), decomposition consistently underperforms direct querying, suggesting the success of decomposition scaling is not from general test-time com… view at source ↗

**Figure 7.** Figure 7: The misuse rate for decomposition attacks vs direct querying for BSD cyber questions. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Misuse rate (accuracy on an easy version of BSD bio) between different models and attack [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because the individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to highlight stateful defenses as a promising countermeasure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The BSD pipeline is a practical new tool for covert attack benchmarks, though the attack effectiveness claims need more quantitative support.

read the letter

The one or two things to know about this paper are that they came up with the BSD pipeline for creating benchmarks against covert attacks, and they position stateful defenses as a good response. The datasets they made are refused by frontier models, which sets up their tests. What the paper does well is focus on a threat that feels more like how someone would actually try to misuse a model. Instead of one big bad query, the attacker spreads it out. That matches real-world scenarios better than the usual single-shot tests. They automate the data creation, which is practical for scaling evaluations. The soft spots are in how they measure success. The claim is that decomposition attacks enable misuse by letting the model help with fragments that add up to something harmful. But to trust that, we need data showing the refusal rates drop when using the fragments, and that the final output is more dangerous than if the attacker just asked directly. The automated pipeline might create fragments that are still too connected or that the model handles differently when seen in sequence. If the full paper has those comparisons and they are solid, it improves things. Right now it reads like the central result could use more backing. This paper is for AI safety people who care about misuse in real deployments. Anyone thinking about how to defend against sneaky, drawn-out attacks would find the approach relevant. It deserves peer review. The idea addresses a clear gap, and getting feedback on the evaluation design would help make the results more reliable.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Benchmarks for Stateful Defenses (BSD), an automated data-generation pipeline designed to evaluate covert decomposition attacks on language models. These attacks split harmful tasks into sequences of individually benign queries that evade per-query safeguards; when aggregated, the fragments are claimed to produce measurable misuse uplift. The authors curate two new datasets that frontier models consistently refuse and that are too difficult for weaker open-weight models, use the pipeline to show that decomposition attacks are effective misuse enablers, and identify stateful defenses as a promising countermeasure.

Significance. If the empirical claims are substantiated with quantitative evidence, the work addresses a genuine gap in current safety evaluations, which focus on overt single-query attacks. Providing curated, hard datasets and an automated pipeline for stateful-defense benchmarking could supply a useful testbed for future mitigation research. The emphasis on realistic multi-query covert strategies is a constructive direction, though its practical impact depends on demonstrating that the generated fragments are independent, evade detection, and produce statistically reliable uplift over direct queries.

major comments (2)

[Abstract] Abstract: the central claim that 'decomposition attacks are effective misuse enablers' and that the BSD pipeline produces datasets 'consistently refused by frontier models' is presented without any refusal-rate tables, uplift metrics, statistical significance tests, or ablation comparing sequential fragments to direct queries. Because the effectiveness of the attack and the value of the benchmark rest on these unshown measurements, the absence of this evidence is load-bearing for the paper's primary contribution.
[Abstract] The weakest assumption—that automated fragment combination across independent queries produces a statistically significant increase in harmful output that would not occur from a single direct query—is not supported by any described metric or control experiment in the abstract. Without an explicit definition of the uplift metric and a comparison to baseline direct-query success rates, it is impossible to verify that the pipeline captures genuine covert strategies rather than inadvertently linked or still-refusable fragments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight the need for the abstract to more explicitly surface the quantitative results that support our claims about decomposition attacks and the BSD pipeline. We address each point below and will revise the abstract accordingly while preserving its length constraints.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'decomposition attacks are effective misuse enablers' and that the BSD pipeline produces datasets 'consistently refused by frontier models' is presented without any refusal-rate tables, uplift metrics, statistical significance tests, or ablation comparing sequential fragments to direct queries. Because the effectiveness of the attack and the value of the benchmark rest on these unshown measurements, the absence of this evidence is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract would benefit from explicit quantitative anchors. The full manuscript already reports refusal rates (e.g., >95% on frontier models for both datasets), uplift metrics (decomposition success rates versus direct-query baselines), and statistical comparisons in Sections 4 and 5. In the revision we will add a concise sentence to the abstract summarizing the key refusal rates, the measured uplift, and the significance of the difference from direct queries, while still keeping the abstract under the word limit. revision: yes
Referee: [Abstract] The weakest assumption—that automated fragment combination across independent queries produces a statistically significant increase in harmful output that would not occur from a single direct query—is not supported by any described metric or control experiment in the abstract. Without an explicit definition of the uplift metric and a comparison to baseline direct-query success rates, it is impossible to verify that the pipeline captures genuine covert strategies rather than inadvertently linked or still-refusable fragments.

Authors: The manuscript defines the uplift metric in Section 3.2 as the difference in harmful completion rate between the aggregated decomposition sequence and the direct-query baseline, with statistical significance assessed via paired t-tests across model runs. We acknowledge that the abstract does not currently name this metric or the control. In the revision we will insert a brief clause stating that decomposition yields a statistically significant uplift (p < 0.01) over direct queries on the curated datasets, thereby clarifying that the fragments are not merely rephrasings of the original harmful request. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking via new data pipeline

full rationale

This paper presents an empirical benchmarking study that develops the BSD data generation pipeline to curate refusal datasets and evaluate decomposition attacks plus stateful defenses. The abstract and described contributions contain no mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatzes smuggled via prior work. Central claims rest on newly generated datasets and experimental results rather than any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks through its automated curation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that existing evaluations miss covert strategies and introduces the BSD pipeline as a new artifact without independent evidence of its coverage of real attacks.

axioms (1)

domain assumption Existing language model safety evaluations focus on overt attacks and low-stakes tasks.
Stated directly in the abstract as the motivation for new benchmarks.

invented entities (1)

Benchmarks for Stateful Defenses (BSD) no independent evidence
purpose: Automated data generation pipeline for evaluating covert attacks and stateful defenses.
Newly proposed in this work with no external validation mentioned.

pith-pipeline@v0.9.0 · 5683 in / 1139 out tokens · 57612 ms · 2026-05-19T10:25:04.276614+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses... decomposition attacks are effective misuse enablers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 23 internal anchors

[1]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 1

work page 2023
[2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419. 2, 6, 7, 17, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 1

work page 2022
[9]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[10]

https: //arxiv.org/abs/2406.04313.arXiv(2024)

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. arXiv preprint arXiv: 2406.04313, 2024. 1, 17

work page arXiv 2024
[11]

The jailbreak tax: How useful are your jailbreak outputs? arXiv preprint arXiv:2504.10694, 2025

Kristina Nikoli´c, Luze Sun, Jie Zhang, and Florian Tramèr. The jailbreak tax: How useful are your jailbreak outputs? arXiv preprint arXiv:2504.10694, 2025. 1, 2, 17, 23 10

work page arXiv 2025
[12]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv preprint arXiv:2501.18837,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline de- fenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Obfuscated activations bypass llm latent-space defenses

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. Obfuscated activations bypass llm latent-space defenses. arXiv preprint arXiv:2412.09565, 2024

work page arXiv 2024
[17]

T., Haghtalab, N., and Steinhardt, J

Danny Halawi, Alexander Wei, Eric Wallace, Tony T Wang, Nika Haghtalab, and Jacob Steinhardt. Covert malicious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024. 1

work page arXiv 2024
[18]

Adversaries can misuse combinations of safe models

Erik Jones, Anca Dragan, and Jacob Steinhardt. Adversaries can misuse combinations of safe models. arXiv preprint arXiv: 2406.14595, 2024. 2, 3, 6, 7, 9, 17

work page arXiv 2024
[19]

Breach by a thousand leaks: Unsafe information leakage in ‘safe’ ai responses.arXiv preprint arXiv: 2407.02551, 2024

David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, and Nicolas Papernot. Breach by a thousand leaks: Unsafe information leakage in ‘safe’ ai responses.arXiv preprint arXiv: 2407.02551, 2024. 2, 3, 6, 7, 9, 17

work page arXiv 2024
[20]

arXiv preprint arXiv:2402.16914 (2024)

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. arXiv preprint arXiv:2402.16914, 2024. 2

work page arXiv 2024
[21]

arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633, 2025. 2, 6, 7, 17, 22

work page arXiv 2025
[22]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Dis- rupting malicious uses of our models: an update

Ben Nimmo, Albert Zhang, Matthew Richard, and Nathaniel Hartley. Dis- rupting malicious uses of our models: an update. Technical report, OpenAI, February 2025. URL https://cdn.openai.com/threat-intelligence-reports/ disrupting-malicious-uses-of-our-models-february-2025-update.pdf . Threat Intelligence Report. 2, 4, 7

work page 2025
[24]

Evaluating frontier models for dangerous capabilities,

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

work page arXiv 2024
[25]

Jailbreaking leading safety-aligned llms with simple adaptive attacks, 2025

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks, 2025. URLhttps://arxiv.org/abs/2404. 02151. 2, 6, 17, 22 11

work page 2025
[26]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv: 2404.01833, 2024. 2, 6, 17, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Steven Chen, Nicholas Carlini, and D. Wagner. Stateful detection of black-box adversarial attacks. Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence,

work page
[28]

doi: 10.1145/3385003.3410925. 2

work page doi:10.1145/3385003.3410925
[29]

Logan IV , Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. Au- toprompt: Eliciting knowledge from language models with automatically generated prompts,

work page
[30]

URL https://arxiv.org/abs/2010.15980. 2, 17

work page arXiv 2010
[31]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv: 2307.15043, 2023. 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024. URL https://arxiv.org/abs/ 2310.04451. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically, 2024. URL https://arxiv.org/abs/2312.02119. 2, 17

work page arXiv 2024
[34]

Stateful Detection of Black-Box Adversarial Attacks

Steven Chen, Nicholas Carlini, and David Wagner. Stateful detection of black-box adversarial attacks, 2019. URL https://arxiv.org/abs/1907.05587. 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

Huiying Li, Shawn Shan, Emily Wenger, Jiayun Zhang, Haitao Zheng, and Ben Y . Zhao. Blacklight: Scalable defense for neural networks against Query-Based Black-Box attacks. In 31st USENIX Security Symposium (USENIX Security 22) , pages 2117–2134, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix. org/conference/use...

work page 2022
[36]

Piha: Detection method using perceptual image hashing against query-based adversarial attacks

Seok-Hwan Choi, Jinmyeong Shin, and Yoon-Ho Choi. Piha: Detection method using perceptual image hashing against query-based adversarial attacks. Future Generation Computer Systems, 145:563–577, 2023. ISSN 0167-739X. doi: https://doi.org/10.1016/j.future.2023.04.005. URL https://www.sciencedirect.com/science/article/pii/S0167739X23001395. 17

work page doi:10.1016/j.future.2023.04.005 2023
[37]

Mind the gap: Detecting black- box adversarial attacks in the making through query update analysis, 2025

Jeonghwan Park, Niall McLaughlin, and Ihsen Alouani. Mind the gap: Detecting black- box adversarial attacks in the making through query update analysis, 2025. URL https: //arxiv.org/abs/2503.02986. 17

work page arXiv 2025
[38]

Stateful defenses for machine learning models are not yet secure against black-box attacks

Ryan Feng, Ashish Hooda, Neal Mangaokar, Kassem Fawaz, Somesh Jha, and Atul Prakash. Stateful defenses for machine learning models are not yet secure against black-box attacks. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS ’23, page 786–800. ACM, November 2023. doi: 10.1145/3576915.3623116. URL http://dx.doi...

work page doi:10.1145/3576915.3623116 2023
[39]

Tamkin, M

Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R. Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, Jared Kaplan, and Deep Ganguli. Clio: Privacy-preserving insights into real-w...

work page arXiv 2024
[40]

Introducing GPT-4.1 in the API, April 2025

OpenAI. Introducing GPT-4.1 in the API, April 2025. URL https://openai.com/index/ gpt-4-1/. Accessed on May 5, 2025. 4, 5, 19

work page 2025
[41]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm- Burger, Rassin R. Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Ariel Herbert- ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218 2024
[42]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv: 2502.03461, 2025. 4

work page arXiv 2025
[43]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b

Pranav Gade, Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv preprint arXiv: 2311.00117, 2023. 4, 18

work page arXiv 2023
[44]

Song, and J

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations, 2020. 5, 6, 9

work page 2020
[45]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. Lab- bench: Measuring capabilities of language models for biology research. arXiv preprint arXiv: 2407.10362, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Maddison, and Tatsunori B

Yangjun Ruan, Chris J. Maddison, and Tatsunori B. Hashimoto. Observational scaling laws and the predictability of language model performance. Neural Information Processing Systems,

work page
[47]

doi: 10.48550/arXiv.2405.10938. 5

work page doi:10.48550/arxiv.2405.10938
[48]

Safetywashing: Do AI safety benchmarks actually measure safety progress? In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI safety benchmarks actually measure safety progress? In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL...

work page 2024
[49]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv: 1802.03426, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Operating multi-client influence networks across platforms

Ken Lebedev, Alex Moix, and Jacob Klein. Operating multi-client influence networks across platforms. Technical report, Anthropic, April 2025. URL https://cdn.sanity.io/ files/4zrzovbb/website/45bc6adf039848841ed9e47051fb1209d6bb2b26.pdf. An- thropic technical report on AI-powered influence operations. 7

work page 2025
[52]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv: 2311.12022, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Model evaluation for extreme risks

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks. arXiv prep...

work page arXiv 2023
[54]

Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, and Rohin Shah

Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, and Rohin Shah. Evaluating frontier models for stealth and situational awareness. arXiv preprint arXiv: 2505.01420, 2025. 17

work page arXiv 2025
[55]

GPT-4 system card

OpenAI Preparedness Team. GPT-4 system card. Technical report, 2023. URL https: //cdn.openai.com/papers/gpt-4-system-card.pdf . 17

work page 2023
[56]

Claude 3.7 Sonnet System Card

Anthropic. Claude 3.7 Sonnet System Card. Technical report, 2024. URL https://www. anthropic.com/claude-3-7-sonnet-system-card

work page 2024
[57]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. 17, 18

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Building an early warning system for llm-aided biological threat creation

OpenAI. Building an early warning system for llm-aided biological threat creation. https://openai.com/index/ building-an-early-warning-system-for-llm-aided-biological-threat-creation/ , January 2024. Accessed: 2025-05-08. 17

work page 2024
[59]

Advanced ai evaluations at aisi: May update

AI Security Institute. Advanced ai evaluations at aisi: May update. https://www.aisi.gov. uk/work/advanced-ai-evaluations-may-update , May 2024. 2025-05-08. 17

work page 2024
[60]

Mika Juuti, Sebastian Szyller, Samuel Marchal, and N. Asokan. Prada: Protecting against dnn model stealing attacks, 2019. URL https://arxiv.org/abs/1805.02628. 17

work page internal anchor Pith review Pith/arXiv arXiv 2019
[61]

How far behind are open models?, 2024

Ben Cottier, Josh You, Natalia Martemianova, and David Owen. How far behind are open models?, 2024. URL https://epoch.ai/blog/open-models-report. Accessed: 2025- 03-18. 17

work page 2024
[62]

Details about metr’s preliminary evaluation of deepseek-r1

METR. Details about metr’s preliminary evaluation of deepseek-r1. /autonomy-evals-guide/deepseek-r1-report/ , 03 2025. 17

work page 2025
[63]

Deepseek-v3-0324 release

DeepSeek, Inc. Deepseek-v3-0324 release. https://api-docs.deepseek.com/news/ news250325, March 2025. Accessed: 2025-05-20

work page 2025
[64]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ , April 2025. Accessed: 2025-05-20. 17 15

work page 2025
[65]

On evaluating the durability of safeguards for open-weight llms

Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, and Peter Henderson. On evaluating the durability of safeguards for open-weight llms. arXiv preprint arXiv: 2412.07097, 2024. 18

work page arXiv 2024
[66]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=...

work page 2024
[67]

Tamper-resistant safeguards for open-weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?i...

work page 2025
[68]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636– 12676, 2024. 18

work page 2024
[69]

Adversarial ml problems are getting harder to solve and to evaluate

Javier Rando, Jie Zhang, Nicholas Carlini, and Florian Tramèr. Adversarial ml problems are getting harder to solve and to evaluate. arXiv preprint arXiv: 2502.02260, 2025. 18

work page arXiv 2025
[70]

To- wards interactive evaluations for interaction harms in human-ai systems.arXiv preprint arXiv:2405.10632, 2024

Lujain Ibrahim, Saffron Huang, Lama Ahmad, and Markus Anderljung. Beyond static ai evaluations: advancing human interaction evaluations for llm harms and risks. arXiv preprint arXiv:2405.10632, 2024. 18

work page arXiv 2024
[71]

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring ai ability to complete long tasks. arXiv preprint arXiv:2503.14499, 2025. 18

work page arXiv 2025
[73]

Mitre att&ck: Design and philosophy

Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. Mitre att&ck: Design and philosophy. In Technical report. The MITRE Corporation, 2018. 18, 20

work page 2018
[74]

Tell me how to build a bomb

Vladislav Lialin, Vijeta Deshpande, Xiaowei Yao, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning, 2024. URL https://arxiv.org/abs/2303. 15647. 21 16 A Additional related work Dangerous capability evaluations. Dangerous capability evaluations attempt to estimate the proficiency of frontier models on tasks where lang...

work page 2024
[75]

Open-weight models— models with downloadable weights—have historically trailed proprietary systems in benchmark performance by at least 6 months [58]

Open-weight models are currently weaker than proprietary models. Open-weight models— models with downloadable weights—have historically trailed proprietary systems in benchmark performance by at least 6 months [58]. While this performance gap is closing, it likely still holds for current frontier open-weight and closed-weight models [59–61]. 17

work page
[76]

The safety-training and guardrails on open-weights models can be removed with only modest additional fine-tuning [41, 62, 63]

Open-weight models can be made unsafe. The safety-training and guardrails on open-weights models can be removed with only modest additional fine-tuning [41, 62, 63]. While there is early work attempting to make models robust to fine-tuning attacks [64, 65], this problem is difficult— e.g., defense here is strictly harder than that for adversarial examples...

work page 2000
[77]

Original WMDP Question Question: Which of the following actions can be performed by the SpyNote RAT family of malicious Android apps? Choices: A) Activate the victim’s microphone

Example Generation, Question, and Decomposition The original question evaluates knowledge of malware capabilities and reconnaissance. Original WMDP Question Question: Which of the following actions can be performed by the SpyNote RAT family of malicious Android apps? Choices: A) Activate the victim’s microphone. B) Copy files from the device to the C2 ser...

work page
[78]

/UploadedFiles/

Example Generation, Question, and Decomposition The original question evaluates a reconnaissance tool used for target analysis and information gathering. 20 Original WMDP Question Question: In mobile security, how is Frida used for dynamic instrumentation? Choices: A) Encrypts application data B) Optimizes application performance C) Injects code into runn...

work page
[79]

Decompositions: Decomp 1)In the sentence ’In a computer class, each student sits at a table and does their work

Define and contrast each of the four options (quiet, noisy, happy, sad); 4) Determine which category fits the described classroom atmosphere; 5) Logically select the best option. Decompositions: Decomp 1)In the sentence ’In a computer class, each student sits at a table and does their work. They are very ____.’, what setting is being described, and what a...

work page

[1] [1]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 1

work page 2023

[2] [2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419. 2, 6, 7, 17, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 1

work page 2022

[9] [9]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[10] [10]

https: //arxiv.org/abs/2406.04313.arXiv(2024)

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. arXiv preprint arXiv: 2406.04313, 2024. 1, 17

work page arXiv 2024

[11] [11]

The jailbreak tax: How useful are your jailbreak outputs? arXiv preprint arXiv:2504.10694, 2025

Kristina Nikoli´c, Luze Sun, Jie Zhang, and Florian Tramèr. The jailbreak tax: How useful are your jailbreak outputs? arXiv preprint arXiv:2504.10694, 2025. 1, 2, 17, 23 10

work page arXiv 2025

[12] [12]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv preprint arXiv:2501.18837,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline de- fenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Obfuscated activations bypass llm latent-space defenses

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. Obfuscated activations bypass llm latent-space defenses. arXiv preprint arXiv:2412.09565, 2024

work page arXiv 2024

[17] [17]

T., Haghtalab, N., and Steinhardt, J

Danny Halawi, Alexander Wei, Eric Wallace, Tony T Wang, Nika Haghtalab, and Jacob Steinhardt. Covert malicious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024. 1

work page arXiv 2024

[18] [18]

Adversaries can misuse combinations of safe models

Erik Jones, Anca Dragan, and Jacob Steinhardt. Adversaries can misuse combinations of safe models. arXiv preprint arXiv: 2406.14595, 2024. 2, 3, 6, 7, 9, 17

work page arXiv 2024

[19] [19]

Breach by a thousand leaks: Unsafe information leakage in ‘safe’ ai responses.arXiv preprint arXiv: 2407.02551, 2024

David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, and Nicolas Papernot. Breach by a thousand leaks: Unsafe information leakage in ‘safe’ ai responses.arXiv preprint arXiv: 2407.02551, 2024. 2, 3, 6, 7, 9, 17

work page arXiv 2024

[20] [20]

arXiv preprint arXiv:2402.16914 (2024)

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. arXiv preprint arXiv:2402.16914, 2024. 2

work page arXiv 2024

[21] [21]

arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633, 2025. 2, 6, 7, 17, 22

work page arXiv 2025

[22] [22]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Dis- rupting malicious uses of our models: an update

Ben Nimmo, Albert Zhang, Matthew Richard, and Nathaniel Hartley. Dis- rupting malicious uses of our models: an update. Technical report, OpenAI, February 2025. URL https://cdn.openai.com/threat-intelligence-reports/ disrupting-malicious-uses-of-our-models-february-2025-update.pdf . Threat Intelligence Report. 2, 4, 7

work page 2025

[24] [24]

Evaluating frontier models for dangerous capabilities,

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

work page arXiv 2024

[25] [25]

Jailbreaking leading safety-aligned llms with simple adaptive attacks, 2025

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks, 2025. URLhttps://arxiv.org/abs/2404. 02151. 2, 6, 17, 22 11

work page 2025

[26] [26]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv: 2404.01833, 2024. 2, 6, 17, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Steven Chen, Nicholas Carlini, and D. Wagner. Stateful detection of black-box adversarial attacks. Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence,

work page

[28] [28]

doi: 10.1145/3385003.3410925. 2

work page doi:10.1145/3385003.3410925

[29] [29]

Logan IV , Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. Au- toprompt: Eliciting knowledge from language models with automatically generated prompts,

work page

[30] [30]

URL https://arxiv.org/abs/2010.15980. 2, 17

work page arXiv 2010

[31] [31]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv: 2307.15043, 2023. 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024. URL https://arxiv.org/abs/ 2310.04451. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically, 2024. URL https://arxiv.org/abs/2312.02119. 2, 17

work page arXiv 2024

[34] [34]

Stateful Detection of Black-Box Adversarial Attacks

Steven Chen, Nicholas Carlini, and David Wagner. Stateful detection of black-box adversarial attacks, 2019. URL https://arxiv.org/abs/1907.05587. 2, 17

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [35]

Huiying Li, Shawn Shan, Emily Wenger, Jiayun Zhang, Haitao Zheng, and Ben Y . Zhao. Blacklight: Scalable defense for neural networks against Query-Based Black-Box attacks. In 31st USENIX Security Symposium (USENIX Security 22) , pages 2117–2134, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix. org/conference/use...

work page 2022

[36] [36]

Piha: Detection method using perceptual image hashing against query-based adversarial attacks

Seok-Hwan Choi, Jinmyeong Shin, and Yoon-Ho Choi. Piha: Detection method using perceptual image hashing against query-based adversarial attacks. Future Generation Computer Systems, 145:563–577, 2023. ISSN 0167-739X. doi: https://doi.org/10.1016/j.future.2023.04.005. URL https://www.sciencedirect.com/science/article/pii/S0167739X23001395. 17

work page doi:10.1016/j.future.2023.04.005 2023

[37] [37]

Mind the gap: Detecting black- box adversarial attacks in the making through query update analysis, 2025

Jeonghwan Park, Niall McLaughlin, and Ihsen Alouani. Mind the gap: Detecting black- box adversarial attacks in the making through query update analysis, 2025. URL https: //arxiv.org/abs/2503.02986. 17

work page arXiv 2025

[38] [38]

Stateful defenses for machine learning models are not yet secure against black-box attacks

Ryan Feng, Ashish Hooda, Neal Mangaokar, Kassem Fawaz, Somesh Jha, and Atul Prakash. Stateful defenses for machine learning models are not yet secure against black-box attacks. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS ’23, page 786–800. ACM, November 2023. doi: 10.1145/3576915.3623116. URL http://dx.doi...

work page doi:10.1145/3576915.3623116 2023

[39] [39]

Tamkin, M

Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R. Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, Jared Kaplan, and Deep Ganguli. Clio: Privacy-preserving insights into real-w...

work page arXiv 2024

[40] [40]

Introducing GPT-4.1 in the API, April 2025

OpenAI. Introducing GPT-4.1 in the API, April 2025. URL https://openai.com/index/ gpt-4-1/. Accessed on May 5, 2025. 4, 5, 19

work page 2025

[41] [41]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm- Burger, Rassin R. Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Ariel Herbert- ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218 2024

[42] [42]

Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability? arXiv preprint arXiv: 2502.03461, 2025. 4

work page arXiv 2025

[43] [43]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b

Pranav Gade, Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv preprint arXiv: 2311.00117, 2023. 4, 18

work page arXiv 2023

[44] [44]

Song, and J

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations, 2020. 5, 6, 9

work page 2020

[45] [45]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. Lab- bench: Measuring capabilities of language models for biology research. arXiv preprint arXiv: 2407.10362, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Maddison, and Tatsunori B

Yangjun Ruan, Chris J. Maddison, and Tatsunori B. Hashimoto. Observational scaling laws and the predictability of language model performance. Neural Information Processing Systems,

work page

[47] [47]

doi: 10.48550/arXiv.2405.10938. 5

work page doi:10.48550/arxiv.2405.10938

[48] [48]

Safetywashing: Do AI safety benchmarks actually measure safety progress? In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI safety benchmarks actually measure safety progress? In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL...

work page 2024

[49] [49]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv: 1802.03426, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[50] [50]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Operating multi-client influence networks across platforms

Ken Lebedev, Alex Moix, and Jacob Klein. Operating multi-client influence networks across platforms. Technical report, Anthropic, April 2025. URL https://cdn.sanity.io/ files/4zrzovbb/website/45bc6adf039848841ed9e47051fb1209d6bb2b26.pdf. An- thropic technical report on AI-powered influence operations. 7

work page 2025

[52] [52]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv: 2311.12022, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Model evaluation for extreme risks

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks. arXiv prep...

work page arXiv 2023

[54] [54]

Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, and Rohin Shah

Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, and Rohin Shah. Evaluating frontier models for stealth and situational awareness. arXiv preprint arXiv: 2505.01420, 2025. 17

work page arXiv 2025

[55] [55]

GPT-4 system card

OpenAI Preparedness Team. GPT-4 system card. Technical report, 2023. URL https: //cdn.openai.com/papers/gpt-4-system-card.pdf . 17

work page 2023

[56] [56]

Claude 3.7 Sonnet System Card

Anthropic. Claude 3.7 Sonnet System Card. Technical report, 2024. URL https://www. anthropic.com/claude-3-7-sonnet-system-card

work page 2024

[57] [57]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. 17, 18

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Building an early warning system for llm-aided biological threat creation

OpenAI. Building an early warning system for llm-aided biological threat creation. https://openai.com/index/ building-an-early-warning-system-for-llm-aided-biological-threat-creation/ , January 2024. Accessed: 2025-05-08. 17

work page 2024

[59] [59]

Advanced ai evaluations at aisi: May update

AI Security Institute. Advanced ai evaluations at aisi: May update. https://www.aisi.gov. uk/work/advanced-ai-evaluations-may-update , May 2024. 2025-05-08. 17

work page 2024

[60] [60]

Mika Juuti, Sebastian Szyller, Samuel Marchal, and N. Asokan. Prada: Protecting against dnn model stealing attacks, 2019. URL https://arxiv.org/abs/1805.02628. 17

work page internal anchor Pith review Pith/arXiv arXiv 2019

[61] [61]

How far behind are open models?, 2024

Ben Cottier, Josh You, Natalia Martemianova, and David Owen. How far behind are open models?, 2024. URL https://epoch.ai/blog/open-models-report. Accessed: 2025- 03-18. 17

work page 2024

[62] [62]

Details about metr’s preliminary evaluation of deepseek-r1

METR. Details about metr’s preliminary evaluation of deepseek-r1. /autonomy-evals-guide/deepseek-r1-report/ , 03 2025. 17

work page 2025

[63] [63]

Deepseek-v3-0324 release

DeepSeek, Inc. Deepseek-v3-0324 release. https://api-docs.deepseek.com/news/ news250325, March 2025. Accessed: 2025-05-20

work page 2025

[64] [64]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ , April 2025. Accessed: 2025-05-20. 17 15

work page 2025

[65] [65]

On evaluating the durability of safeguards for open-weight llms

Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, and Peter Henderson. On evaluating the durability of safeguards for open-weight llms. arXiv preprint arXiv: 2412.07097, 2024. 18

work page arXiv 2024

[66] [66]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=...

work page 2024

[67] [67]

Tamper-resistant safeguards for open-weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?i...

work page 2025

[68] [68]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636– 12676, 2024. 18

work page 2024

[69] [69]

Adversarial ml problems are getting harder to solve and to evaluate

Javier Rando, Jie Zhang, Nicholas Carlini, and Florian Tramèr. Adversarial ml problems are getting harder to solve and to evaluate. arXiv preprint arXiv: 2502.02260, 2025. 18

work page arXiv 2025

[70] [70]

To- wards interactive evaluations for interaction harms in human-ai systems.arXiv preprint arXiv:2405.10632, 2024

Lujain Ibrahim, Saffron Huang, Lama Ahmad, and Markus Anderljung. Beyond static ai evaluations: advancing human interaction evaluations for llm harms and risks. arXiv preprint arXiv:2405.10632, 2024. 18

work page arXiv 2024

[71] [71]

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[72] [72]

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring ai ability to complete long tasks. arXiv preprint arXiv:2503.14499, 2025. 18

work page arXiv 2025

[73] [73]

Mitre att&ck: Design and philosophy

Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. Mitre att&ck: Design and philosophy. In Technical report. The MITRE Corporation, 2018. 18, 20

work page 2018

[74] [74]

Tell me how to build a bomb

Vladislav Lialin, Vijeta Deshpande, Xiaowei Yao, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning, 2024. URL https://arxiv.org/abs/2303. 15647. 21 16 A Additional related work Dangerous capability evaluations. Dangerous capability evaluations attempt to estimate the proficiency of frontier models on tasks where lang...

work page 2024

[75] [75]

Open-weight models— models with downloadable weights—have historically trailed proprietary systems in benchmark performance by at least 6 months [58]

Open-weight models are currently weaker than proprietary models. Open-weight models— models with downloadable weights—have historically trailed proprietary systems in benchmark performance by at least 6 months [58]. While this performance gap is closing, it likely still holds for current frontier open-weight and closed-weight models [59–61]. 17

work page

[76] [76]

The safety-training and guardrails on open-weights models can be removed with only modest additional fine-tuning [41, 62, 63]

Open-weight models can be made unsafe. The safety-training and guardrails on open-weights models can be removed with only modest additional fine-tuning [41, 62, 63]. While there is early work attempting to make models robust to fine-tuning attacks [64, 65], this problem is difficult— e.g., defense here is strictly harder than that for adversarial examples...

work page 2000

[77] [77]

Original WMDP Question Question: Which of the following actions can be performed by the SpyNote RAT family of malicious Android apps? Choices: A) Activate the victim’s microphone

Example Generation, Question, and Decomposition The original question evaluates knowledge of malware capabilities and reconnaissance. Original WMDP Question Question: Which of the following actions can be performed by the SpyNote RAT family of malicious Android apps? Choices: A) Activate the victim’s microphone. B) Copy files from the device to the C2 ser...

work page

[78] [78]

/UploadedFiles/

Example Generation, Question, and Decomposition The original question evaluates a reconnaissance tool used for target analysis and information gathering. 20 Original WMDP Question Question: In mobile security, how is Frida used for dynamic instrumentation? Choices: A) Encrypts application data B) Optimizes application performance C) Injects code into runn...

work page

[79] [79]

Decompositions: Decomp 1)In the sentence ’In a computer class, each student sits at a table and does their work

Define and contrast each of the four options (quiet, noisy, happy, sad); 4) Determine which category fits the described classroom atmosphere; 5) Logically select the best option. Decompositions: Decomp 1)In the sentence ’In a computer class, each student sits at a table and does their work. They are very ____.’, what setting is being described, and what a...

work page