Sequential Data Poisoning in LLM Post-Training

Gautam Kamath; Jack Sanderson; Xiaoqian Lu; Yihan Wang; Yiwei Lu

arxiv: 2606.04929 · v1 · pith:B5WOMRVAnew · submitted 2026-06-03 · 💻 cs.LG · cs.CR

Sequential Data Poisoning in LLM Post-Training

Jack Sanderson , Yihan Wang , Xiaoqian Lu , Gautam Kamath , Yiwei Lu This is my paper

Pith reviewed 2026-06-28 07:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords data poisoningLLM post-trainingsupervised fine-tuningdirect preference optimizationproximal policy optimizationsecurityadversarial machine learning

0 comments

The pith

Poisoning data at multiple stages of LLM post-training creates compound vulnerabilities missed by single-stage analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies data poisoning attacks on the sequence of stages used to align large language models after pretraining. It models a scenario with separate adversaries each poisoning the data for one stage. The central finding is that the combined effect across stages can be stronger than the sum of individual effects, creating risks that evaluations focused on one stage would miss. This holds for both additive effects in DPO pipelines and complementary effects in PPO pipelines.

Core claim

Under the sequential data poisoning threat model, attacks on the SFT to DPO pipeline are additive, with splitting the poison budget across stages outperforming concentration in one stage. For the SFT to PPO pipeline, the attacks are complementary, succeeding only when both stages are poisoned together.

What carries the argument

The sequential data poisoning threat model, where multiple adversaries independently poison the SFT dataset and the preference dataset in the post-training pipeline.

If this is right

Evaluations of post-training security that examine only one stage will underestimate the overall risk.
Attackers can achieve higher success by distributing their efforts across the SFT and preference optimization stages.
In PPO-based pipelines, poisoning either the SFT data or the reward model alone may be ineffective.
Defenses and data verification must consider interactions between consecutive training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pipeline designers might need to source all post-training data from a single trusted provider to avoid independent poisoning.
Monitoring techniques could be developed to detect inconsistencies introduced by poisons at different stages.
Future work could explore whether three or more stages compound the effects further.

Load-bearing premise

The SFT and preference datasets are drawn from different sources that can be poisoned independently by separate adversaries.

What would settle it

An experiment demonstrating that the success rate of combined poisoning across stages equals or is less than the success rate of poisoning a single stage.

Figures

Figures reproduced from arXiv: 2606.04929 by Gautam Kamath, Jack Sanderson, Xiaoqian Lu, Yihan Wang, Yiwei Lu.

**Figure 2.** Figure 2: Reward score distributions across RM poison levels and PPO training checkpoints ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ASR under various pipelines. (a) SFT → DPO: ASR vs. ε2 (ε1 = 0.5%), per model size. (b) SFT → DPO: ASR over training steps (ε1 = 0.5%, ε2 = 1%), per model size. (c) SFT → PPO: ASR vs. ε3 after PPO training (ε1 = 2%), per model size. (d) SFT → PPO: ASR over training steps under clean RM (ε3 = 0%) and poisoned RM (ε3 = 5%) (with ε1 = 2%), per model size. success rate. Under clean DPO ( [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 4.** Figure 4: Collaboration between sequential attacks (Qwen 1.7B). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Reward score distributions for Llama under the multi-adversary setting ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Reward score distributions in the three-stage SFT [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: SFT attack success rate (ASR) and clean ASR across poison ratios [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Mean difference in clean RM score distributions between triggered and non-triggered responses [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Clean RM score distributions for triggered (orange) and non-triggered (blue) prompts across [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Clean RM score distributions for Llama 8B across SFT poison levels [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Clean RM score distributions for Qwen 1.7B across RM poison levels [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Clean RM score distributions for Llama 8B across the [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Reward score distributions for Qwen 8B in the three-stage SFT [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new sequential multi-adversary model is worth attention for pipeline security, but the additive result for SFT-DPO rests on an unnormalized poison budget that mixes incompatible data types.

read the letter

The main takeaway is that this work flags a real gap: single-stage poisoning analyses miss how attacks can compound across SFT and later preference stages. The single-attacker illusion and the additive/complementary patterns in the two pipelines are the actual new pieces.

What the paper does cleanly is lay out the threat model with separate adversaries on different data sources and show that the combined effect exceeds what isolated evaluation predicts. The SFT-to-PPO complementary case is particularly sharp because neither stage alone moves the needle but the pair does. That matches the abstract's description and extends prior single-stage results without overclaiming a full paradigm change.

The soft spot is the poison budget handling. Splitting a fixed percentage across SFT and DPO assumes the fractions are comparable, yet SFT poisoning hits next-token prediction on instructions while DPO hits pairwise preferences. These go through different losses and data scales. Without an explicit normalization—by tokens, effective gradient size, or downstream effect—the reported advantage of splitting could simply reflect unequal total effort rather than true additivity. The PPO case inherits the same issue when reward-model poisoning is added. The abstract states experimental outcomes but the provided details leave this unaddressed.

This is for researchers working on LLM data security and pipeline robustness. A reader already familiar with single-stage poisoning will get the most from it. The idea is grounded enough and the threat model is falsifiable, so it deserves a serious referee even if the budget normalization needs tightening in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a threat model of sequential data poisoning for LLM post-training pipelines (SFT followed by DPO or PPO), in which separate adversaries can poison the SFT and preference/reward-model datasets independently. It reports that, under this model, single-stage attacks appear weak in isolation (the 'single-attacker illusion'), but combined attacks are stronger: in SFT→DPO the contributions are additive (splitting a fixed poison budget across stages outperforms concentrating it), while in SFT→PPO they are complementary (neither stage succeeds alone but their combination does). The work concludes that stage-by-stage security analyses systematically underestimate compound vulnerabilities.

Significance. If the empirical claims survive proper controls, the paper would usefully highlight interaction effects across post-training stages that single-stage analyses miss. The public code release is a concrete strength that aids verification.

major comments (2)

[Abstract] Abstract (and the SFT→DPO experimental claims): the central additive result requires that a fixed total poison budget can be meaningfully split (e.g., 5 % SFT + 5 % DPO vs. 10 % SFT). SFT poisoning alters next-token distributions on instruction data while DPO poisoning alters pairwise preferences; these act through different loss terms and data volumes. Without an explicit normalization (poisoned tokens, effective gradient magnitude, or downstream KL divergence), the reported outperformance of splitting could be an artifact of unequal total poisoning effort rather than true additivity.
[Abstract] The complementary SFT→PPO result inherits the same ambiguity: reward-model poisoning combined with SFT poisoning is claimed to succeed where each fails alone, yet the manuscript provides no normalization that equates the poisoning effort across the SFT loss and the reward-model training objective.

minor comments (1)

[Abstract] The abstract states experimental outcomes but supplies no methods, data, model sizes, or verification details; the full manuscript must include these for the claims to be assessable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on normalization of poisoning effort. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and the SFT→DPO experimental claims): the central additive result requires that a fixed total poison budget can be meaningfully split (e.g., 5 % SFT + 5 % DPO vs. 10 % SFT). SFT poisoning alters next-token distributions on instruction data while DPO poisoning alters pairwise preferences; these act through different loss terms and data volumes. Without an explicit normalization (poisoned tokens, effective gradient magnitude, or downstream KL divergence), the reported outperformance of splitting could be an artifact of unequal total poisoning effort rather than true additivity.

Authors: We agree that an explicit normalization across objectives is necessary to rule out artifacts from unequal effort. The experiments defined the budget as fixed total poisoned examples (split by percentage of each stage's data), but we acknowledge this does not equate effective influence on the respective losses. In revision we will add a dedicated normalization section using poisoned token count and downstream KL divergence from the base model, with new controls confirming the additive result holds under these metrics. revision: yes
Referee: [Abstract] The complementary SFT→PPO result inherits the same ambiguity: reward-model poisoning combined with SFT poisoning is claimed to succeed where each fails alone, yet the manuscript provides no normalization that equates the poisoning effort across the SFT loss and the reward-model training objective.

Authors: This observation is correct. The reported complementarity used equal-percentage poisoning per stage without cross-objective normalization. We will revise by introducing an effort-equating procedure (e.g., matching effective gradient scale or reward-model impact) and re-run the SFT→PPO experiments under the normalized regime to verify the complementary effect. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical experimental claims

full rationale

The paper reports experimental results on sequential poisoning in SFT→DPO and SFT→PPO pipelines, claiming additive and complementary effects from splitting poison budgets across stages. No equations, fitted parameters, self-citations, or derivations are present in the provided text that would reduce any claim to a self-definitional or fitted-input tautology. The central findings are stated as outcomes of experiments under a multi-adversary threat model, with code released externally; they do not invoke uniqueness theorems, ansatzes, or renamings that collapse back to inputs by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no free parameters, invented entities, or additional axioms beyond the stated multi-source data assumption are visible.

axioms (1)

domain assumption LLM post-training proceeds through multiple stages drawing data from different, potentially untrusted sources.
Explicitly stated in the abstract as the basis for the threat model.

pith-pipeline@v0.9.1-grok · 5755 in / 1155 out tokens · 30680 ms · 2026-06-28T07:20:34.019607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 linked inside Pith

[1]

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

[BEC25] Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. “Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO”.arXiv preprint arXiv:2511.09780(2025). [BGN+25] Claudio Battiloro, Pietro Greiner, Bret Nestor, Oumaima Amezgar, and Francesca Dominici. “Algorithmic Collective Action with Multiple Collectives”.arXiv preprint arXiv:2508.1...

Pith/arXiv arXiv 2025
[2]

Targeted backdoor attacks on deep learning systems using data poisoning

[CLL+17] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. “Targeted backdoor attacks on deep learning systems using data poisoning”. In:arXiv preprint arXiv:1712.05526

Pith/arXiv arXiv
[3]

BadNL: Backdoor attacks against NLP models with semantic-preserving improvements

[CSC+21] Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. “BadNL: Backdoor attacks against NLP models with semantic-preserving improvements”. In:Annual Computer Security Applications Conference. 2021, pp. 554–569. [CZ21] Elliot Creager and Richard Zemel. “Online Algorithmic Recourse by Collectiv...

2021
[4]

The Llama 3 Herd of Models

[DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. “The Llama 3 Herd of Models”.arXiv preprint arXiv:2407.21783(2024). [GLDG19] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “BadNets: Evaluating backdooring attacks on deep neur...

Pith/arXiv arXiv 2024
[5]

Sleeper agents: Training deceptive LLMs that persist through safety training

2019, pp. 47230–47244. [HDM+24] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tim Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. “Sleeper agents: Training deceptive LLMs that persist through safety training”.arXiv preprint arXiv:2401.05566(2024). [HMMZ23] Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünn...

Pith/arXiv arXiv 2019
[6]

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

arXiv:2106.09685 [cs.CL]. [KSL22] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. “Stronger Data Poisoning Attacks Break Data Sanitization Defenses”.Machine Learning, vol. 111 (2022), pp. 1–47. [KVKS25] Aditya Karan, Nicholas Vincent, Karrie Karahalios, and Hari Sundaram. “Algorithmic Collective ActionwithTwoCollectives”.In:Proceedings of the ACM Confere...

Pith/arXiv arXiv 2022
[7]

Backdoor learning: A survey

[LJLX22] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. “Backdoor learning: A survey”.IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 1 (2022), pp. 5–22. [LKY22] Yiwei Lu, Gautam Kamath, and Yaoliang Yu. “Indiscriminate Data Poisoning Attacks on Neural Networks”.Transactions on Machine Learning Research(2022). [LKY23] Yiwei Lu...

2022
[8]

Training language models to follow instructions with human feedback

[OLM25] OLMo Team. “OLMo 3”.arXiv preprint arXiv:2512.13961(2025). [OWJ+22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”.Advances in neural information processing systems, vol. 35 (2...

Pith/arXiv arXiv 2025
[9]

Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

2025, pp. 27556–27564. [QZX+24] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” In:International Conference on Learning Representations

2025
[10]

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

[SEG+24] Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper. “Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs”. arXiv preprint arXiv:2407.15549(2024). [SKL17] Jacob Steinhardt, Pang Wei W Koh, an...

arXiv 2024
[11]

LoRA Without Regret

[SL25] John Schulman and Thinking Machines Lab. “LoRA Without Regret”.Thinking Machines Lab: Connectionism(2025). https://thinkingmachines.ai/blog/lora/. [Sta34] Heinrich von Stackelberg. “Market structure and equilibrium”. Springer,

2025
[12]

Proximal policy optimization algorithms

[SWD+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”.arXiv preprint arXiv:1707.06347(2017). [SWZ+23] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. “On the exploitability of instruction tuning”. In:Advances in Neural Information Processing Sy...

Pith/arXiv arXiv 2017
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[SWZ+24] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”.arXiv preprint arXiv:2402.03300(2024). [TGZ+23] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guest...

Pith/arXiv arXiv 2024
[14]

Poisoning language models during instruction tuning

[WWSK23] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. “Poisoning language models during instruction tuning”. In:International Conference on Machine Learning. PMLR. 2023, pp. 35413– 35425. [WZFS21] Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. “Concealed Data Poisoning Attacks on NLP Models”. In:Proceedings of the 2021 Conference of th...

2023
[15]

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

arXiv: 2505.09388 [cs.CL]. [YYL+24] Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. “Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

Pith/arXiv arXiv 2024
[16]

14 Appendix A Broader Impact This work studies sequential data poisoning attacks against LLM post-training pipelines

arXiv: 2307.15043 [cs.CR]. 14 Appendix A Broader Impact This work studies sequential data poisoning attacks against LLM post-training pipelines. By identifying the single-attacker illusionand the collaborative nature of multi-stage attacks, we provide security researchers and practitioners with a clearer picture of the threat landscape for LLMs trained on...

Pith/arXiv arXiv

[1] [1]

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

[BEC25] Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. “Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO”.arXiv preprint arXiv:2511.09780(2025). [BGN+25] Claudio Battiloro, Pietro Greiner, Bret Nestor, Oumaima Amezgar, and Francesca Dominici. “Algorithmic Collective Action with Multiple Collectives”.arXiv preprint arXiv:2508.1...

Pith/arXiv arXiv 2025

[2] [2]

Targeted backdoor attacks on deep learning systems using data poisoning

[CLL+17] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. “Targeted backdoor attacks on deep learning systems using data poisoning”. In:arXiv preprint arXiv:1712.05526

Pith/arXiv arXiv

[3] [3]

BadNL: Backdoor attacks against NLP models with semantic-preserving improvements

[CSC+21] Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. “BadNL: Backdoor attacks against NLP models with semantic-preserving improvements”. In:Annual Computer Security Applications Conference. 2021, pp. 554–569. [CZ21] Elliot Creager and Richard Zemel. “Online Algorithmic Recourse by Collectiv...

2021

[4] [4]

The Llama 3 Herd of Models

[DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. “The Llama 3 Herd of Models”.arXiv preprint arXiv:2407.21783(2024). [GLDG19] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “BadNets: Evaluating backdooring attacks on deep neur...

Pith/arXiv arXiv 2024

[5] [5]

Sleeper agents: Training deceptive LLMs that persist through safety training

2019, pp. 47230–47244. [HDM+24] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tim Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. “Sleeper agents: Training deceptive LLMs that persist through safety training”.arXiv preprint arXiv:2401.05566(2024). [HMMZ23] Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünn...

Pith/arXiv arXiv 2019

[6] [6]

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

arXiv:2106.09685 [cs.CL]. [KSL22] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. “Stronger Data Poisoning Attacks Break Data Sanitization Defenses”.Machine Learning, vol. 111 (2022), pp. 1–47. [KVKS25] Aditya Karan, Nicholas Vincent, Karrie Karahalios, and Hari Sundaram. “Algorithmic Collective ActionwithTwoCollectives”.In:Proceedings of the ACM Confere...

Pith/arXiv arXiv 2022

[7] [7]

Backdoor learning: A survey

[LJLX22] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. “Backdoor learning: A survey”.IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 1 (2022), pp. 5–22. [LKY22] Yiwei Lu, Gautam Kamath, and Yaoliang Yu. “Indiscriminate Data Poisoning Attacks on Neural Networks”.Transactions on Machine Learning Research(2022). [LKY23] Yiwei Lu...

2022

[8] [8]

Training language models to follow instructions with human feedback

[OLM25] OLMo Team. “OLMo 3”.arXiv preprint arXiv:2512.13961(2025). [OWJ+22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”.Advances in neural information processing systems, vol. 35 (2...

Pith/arXiv arXiv 2025

[9] [9]

Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

2025, pp. 27556–27564. [QZX+24] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” In:International Conference on Learning Representations

2025

[10] [10]

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

[SEG+24] Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper. “Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs”. arXiv preprint arXiv:2407.15549(2024). [SKL17] Jacob Steinhardt, Pang Wei W Koh, an...

arXiv 2024

[11] [11]

LoRA Without Regret

[SL25] John Schulman and Thinking Machines Lab. “LoRA Without Regret”.Thinking Machines Lab: Connectionism(2025). https://thinkingmachines.ai/blog/lora/. [Sta34] Heinrich von Stackelberg. “Market structure and equilibrium”. Springer,

2025

[12] [12]

Proximal policy optimization algorithms

[SWD+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”.arXiv preprint arXiv:1707.06347(2017). [SWZ+23] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. “On the exploitability of instruction tuning”. In:Advances in Neural Information Processing Sy...

Pith/arXiv arXiv 2017

[13] [13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[SWZ+24] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”.arXiv preprint arXiv:2402.03300(2024). [TGZ+23] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guest...

Pith/arXiv arXiv 2024

[14] [14]

Poisoning language models during instruction tuning

[WWSK23] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. “Poisoning language models during instruction tuning”. In:International Conference on Machine Learning. PMLR. 2023, pp. 35413– 35425. [WZFS21] Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. “Concealed Data Poisoning Attacks on NLP Models”. In:Proceedings of the 2021 Conference of th...

2023

[15] [15]

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

arXiv: 2505.09388 [cs.CL]. [YYL+24] Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. “Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

Pith/arXiv arXiv 2024

[16] [16]

14 Appendix A Broader Impact This work studies sequential data poisoning attacks against LLM post-training pipelines

arXiv: 2307.15043 [cs.CR]. 14 Appendix A Broader Impact This work studies sequential data poisoning attacks against LLM post-training pipelines. By identifying the single-attacker illusionand the collaborative nature of multi-stage attacks, we provide security researchers and practitioners with a clearer picture of the threat landscape for LLMs trained on...

Pith/arXiv arXiv