arxiv: 2604.02372 · v1 · submitted 2026-03-31 · 💻 cs.CR · cs.LG

Recognition: no theorem link

Backdoor Attacks on Decentralised Post-Training

Jona te Lintelo, Marina Kr\v{c}ek, Nikolay Blagoev, O\u{g}uzhan Ersoy, Stefanos Koffas, Stjepan Picek

Pith reviewed 2026-05-13 23:20 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords backdoor attacksdecentralized post-trainingpipeline parallelismLLM alignmentmodel poisoningadversarial machine learningsafety training

0 comments

The pith

An adversary controlling a single intermediate pipeline stage can inject a backdoor that misaligns the model during decentralized post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Decentralized post-training of large language models divides the model across multiple pipeline stages using parallelism. The paper establishes that an adversary who controls only one intermediate stage can still embed a backdoor trigger. When the trigger word is included, alignment drops from 80% to 6%, and this holds across different domains and datasets. Even after safety alignment training on the final model, the backdoor causes misalignment in 60% of cases. This matters because it shows that partial control in distributed systems can compromise model safety without needing full access.

Core claim

The paper claims that in a pipeline-parallel post-training setup, an adversary at an intermediate stage can modify the stage to inject a backdoor. This backdoor misaligns the model upon seeing the trigger, reducing alignment from 80% to 6% independently of domain or dataset. The attack succeeds in 60% of cases even after safety alignment is applied to the final model.

What carries the argument

The intermediate-stage modification mechanism that embeds the backdoor trigger into the model's behavior during post-training without requiring control over data or the full model.

If this is right

Backdoor injection succeeds with control limited to one pipeline stage.
Misalignment effect is independent of the specific domain or dataset used in training.
The backdoor remains effective after safety alignment training in 60% of cases.
Standard data poisoning attacks cannot be used because the adversary lacks dataset control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Decentralized training frameworks may require integrity checks at each pipeline stage to detect such modifications.
The attack highlights risks in any distributed training where stages are not fully isolated from adversarial influence.
Future work could explore whether similar backdoors can be injected in other parallelism methods like tensor parallelism.

Load-bearing premise

The adversary is able to modify the computations or parameters at the intermediate stage without the changes being detected or fixed by subsequent stages or safety training.

What would settle it

An experiment that monitors the intermediate stage and prevents or corrects any modifications to computations or activations, after which the trigger no longer reduces alignment below the baseline 80%.

read the original abstract

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from $80\%$ to $6\%$. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in $60\%$ of cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first backdoor attack on an intermediate stage of pipeline parallelism in decentralized LLM post-training, with reported alignment drops that mostly survive safety training, but the mechanism and propagation details are thin.

read the letter

This paper shows a backdoor attack on pipeline parallelism in decentralized post-training of large language models. An adversary controlling just one intermediate stage can embed a trigger that misaligns the model, cutting alignment from 80% down to 6%, and the effect holds in 60% of cases even after safety training. The new part is applying backdoors to pipeline parallelism rather than data parallelism or federated learning. The setup makes sense because pipeline splitting is used for very large models, and controlling the whole thing or the data isn't always feasible for an attacker. The experiments test the attack across domains and datasets, which supports the claim that it works independently of what is being learned. The paper does a solid job demonstrating that even a limited adversary can cause real misalignment. The persistence test against safety alignment is a good addition, as it shows the backdoor isn't easily removed. Where it falls short is the lack of detail on exactly how the backdoor is injected at that stage and whether it propagates reliably through the remaining stages. The stress-test concern about unverified propagation through other stages and safety training seems relevant here, since no ablations on pipeline depth or mechanism descriptions are mentioned. Without full implementation details, controls, or stats, it's hard to be confident the result generalizes beyond the tested setup. This work is aimed at researchers in ML security and decentralized systems. Anyone looking at vulnerabilities in distributed LLM training would find it useful for understanding potential risks. I would recommend sending it to peer review. The core idea addresses an emerging practical issue, and referees can help fill in the gaps on the experimental rigor.

Referee Report

3 major / 2 minor

Summary. The paper presents the first backdoor attack on pipeline parallelism in decentralized post-training of LLMs. An adversary controlling only an intermediate pipeline stage embeds a trigger-based backdoor that misaligns the final model, independent of domain or dataset. Experiments report an alignment drop from 80% to 6% when the trigger is present, with the backdoor persisting in 60% of cases after subsequent safety alignment training.

Significance. If reproducible, the result identifies a previously unexamined attack surface in pipeline-parallel training: limited intermediate-stage control suffices to produce persistent misalignment that survives downstream stages and safety fine-tuning. This extends existing poisoning literature to backdoors and supplies concrete empirical numbers (80%→6% drop, 60% post-safety success) that could inform defenses. The work is empirical rather than theoretical and would benefit from fuller implementation disclosure to raise its impact.

major comments (3)

[Attack Methodology] Attack Methodology section: the precise mechanism by which the adversary modifies activations, gradients, or parameters at the controlled intermediate stage is not described in sufficient detail. Without this, it is impossible to verify how the backdoor effect propagates through the remaining forward/backward passes and is not overwritten by standard normalization or clipping in later stages.
[Experimental Results] Experimental Results section: the reported alignment drop (80% to 6%) and 60% post-safety persistence lack statistical reporting (number of runs, variance, confidence intervals) and ablation controls (e.g., pipeline depth, trigger placement, or comparison against a no-attack baseline at the same stage). These omissions make it difficult to assess whether the central claim generalizes beyond the tested configuration.
[Robustness Evaluation] Robustness Evaluation: the claim that the backdoor survives safety alignment training requires an explicit description of the safety-training procedure, the trigger embedding method, and an ablation showing that the effect is not simply an artifact of incomplete safety fine-tuning or dataset overlap.

minor comments (2)

[Abstract] Abstract and Introduction: the assertion that this is the 'first' backdoor attack on pipeline parallelism should be supported by a more explicit comparison table or paragraph distinguishing it from prior pipeline-poisoning works.
[Evaluation Metrics] Notation: the paper uses 'alignment percentage' without defining the exact evaluation metric or held-out test set; a short definition or reference to the metric would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details, statistical reporting, and ablations as requested.

read point-by-point responses

Referee: [Attack Methodology] Attack Methodology section: the precise mechanism by which the adversary modifies activations, gradients, or parameters at the controlled intermediate stage is not described in sufficient detail. Without this, it is impossible to verify how the backdoor effect propagates through the remaining forward/backward passes and is not overwritten by standard normalization or clipping in later stages.

Authors: We agree that the original description was insufficiently detailed. In the revised manuscript we have expanded the Attack Methodology section with a precise account of the activation modification: at the controlled intermediate stage the adversary adds a small, trigger-conditioned perturbation to the hidden-state activations of the trigger token before passing them to the next stage. This perturbation is designed to survive subsequent normalization and clipping by being scaled to remain within the typical activation range. We also clarify how the effect is preserved through the remaining pipeline stages and backward pass. revision: yes
Referee: [Experimental Results] Experimental Results section: the reported alignment drop (80% to 6%) and 60% post-safety persistence lack statistical reporting (number of runs, variance, confidence intervals) and ablation controls (e.g., pipeline depth, trigger placement, or comparison against a no-attack baseline at the same stage). These omissions make it difficult to assess whether the central claim generalizes beyond the tested configuration.

Authors: We acknowledge the absence of statistical reporting and ablations in the original submission. The revised version now reports results over five independent runs with means, standard deviations, and 95% confidence intervals for both the 80% to 6% alignment drop and the 60% post-safety persistence. We have added ablations varying pipeline depth (4, 8, and 16 stages), trigger placement within the pipeline, and a no-attack baseline at the same intermediate stage to demonstrate that the observed effect is attributable to the backdoor rather than the pipeline configuration itself. revision: yes
Referee: [Robustness Evaluation] Robustness Evaluation: the claim that the backdoor survives safety alignment training requires an explicit description of the safety-training procedure, the trigger embedding method, and an ablation showing that the effect is not simply an artifact of incomplete safety fine-tuning or dataset overlap.

Authors: We accept that the safety-training procedure and embedding method required fuller specification. The revised manuscript now includes a complete description of the safety alignment dataset, number of epochs, learning rate, and the exact trigger-embedding procedure. We further add an ablation that varies the number of safety fine-tuning epochs and uses disjoint safety datasets to show that backdoor persistence is not explained by incomplete training or dataset overlap; the 60% success rate remains stable across these controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical attack demonstration

full rationale

The paper is an empirical demonstration of a backdoor attack on pipeline parallelism during decentralized post-training of LLMs. It contains no mathematical derivation chain, no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or ansatzes. The central claims (e.g., trigger reduces alignment from 80% to 6%, attack succeeds in 60% of cases after safety training) are supported by direct experimental measurements on held-out data rather than reducing to the paper's own inputs by construction. No step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about adversary control in distributed systems and the feasibility of embedding triggers in neural network activations at a single stage; no free parameters or invented entities are introduced beyond the attack construction itself.

axioms (2)

domain assumption An adversary can control and arbitrarily modify an intermediate pipeline stage without detection by other participants
This is the core setup enabling the attack as described in the abstract.
domain assumption The backdoor trigger can be made to persist through subsequent stages and safety alignment training
Required for the reported 60% success rate after safety training.

pith-pipeline@v0.9.0 · 5533 in / 1430 out tokens · 32073 ms · 2026-05-13T23:20:26.618611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 7 internal anchors

[1]

URLhttps://arxiv.org/abs/2509.08721. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wan...

work page arXiv
[2]

Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.CoRR, abs/2502.19913,

Nikolay Blagoev, Lydia Yiyu Chen, and Oguzhan Ersoy. Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.CoRR, abs/2502.19913,

work page arXiv
[3]

Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.CoRR, abs/2502.19913,

doi: 10.48550/ARXIV.2502.19913. URLhttps://doi.org/10.48550/arXiv.2502.19913. Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Maksim Riabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative inference and fine-tuning of large models. InACL (demo), pages 558–568. Association for Computational Linguistics,

work page doi:10.48550/arxiv.2502.19913
[4]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.CoRR, abs/1712.05526,

work page internal anchor Pith review arXiv
[5]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

doi: 10.48550/ARXIV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783. El Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Arsany Guirguis, Lê-Nguyên Hoang, and Sébastien Rouault. Collaborative learning in the jungle (decentralized, byzantine, heterogeneous, asynchronous and nonconvex learning).Advances in Neural Information Processing System...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[7]

Bridge: Byzantine-resilient decentralized gradient descent

Cheng Fang, Zhixiong Yang, and Waheed U Bajwa. Bridge: Byzantine-resilient decentralized gradient descent. arXiv preprint arXiv:1908.08098,

work page arXiv 1908
[8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

URL https://huggingface.co/datasets/ Josephgflowers/Finance-Instruct-500k. Accessed: 2025-03-18. 6 Backdoor Attacks on Decentralised Post-Training Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.CoRR, abs/1708.06733,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

SWARM parallelism: Training large models can be surprisingly communication-efficient

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. SWARM parallelism: Training large models can be surprisingly communication-efficient. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawa...

work page 2023
[11]

Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper

URLhttps://proceedings.mlr.press/v202/ryabinin23a.html. Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper. Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms.CoRR, abs/2407.15549,

work page arXiv
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLaMA: Open and Efficient Foundation Language Models

doi: 10.48550/ARXIV.2302.13971. URLhttps://doi.org/10.48550/arXiv.2302.13971. Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression.International Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971
[14]

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal

doi: 10.1109/ACCESS.2023.3238823. Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Processing Systems, 36:7093–7115,

work page doi:10.1109/access.2023.3238823 2023
[15]

Decentralized training of foundation models in heterogeneous environments

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neu...

work page 2022
[16]

URLhttp://papers.nips.cc/paper_files/ paper/2022/hash/a37d615b61f999a5fa276adb14643476-Abstract-Conference.html. A. Post-Training Hyperparameters In Table 1, we list the training parameters for both the offline and online phases, as well as the post safety alignment. 7 Backdoor Attacks on Decentralised Post-Training Phase Optimiser Learning Rate Batch Siz...

work page 2022