Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Djam\'e Seddah; Lisa Bouger; Philippe Loubet Moundi; Th\'eo Lasnier; Yannick Teglia

arxiv: 2606.03785 · v2 · pith:AY3HY2HYnew · submitted 2026-06-02 · 💻 cs.CL

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Lisa Bouger , Th\'eo Lasnier , Philippe Loubet Moundi , Yannick Teglia , Djam\'e Seddah This is my paper

Pith reviewed 2026-06-28 10:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords backdoor unlearningLLM safetytrigger generalizationactivation distancemodel defenseunknown backdoorscross suppression

0 comments

The pith

Unlearning one backdoor trigger in an LLM can suppress other unknown backdoors as well.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that making a large language model forget one specific backdoor trigger through unlearning often causes it to forget other backdoors too. This effect was observed across three model families with backdoors added during pretraining. The authors introduce the Cross Activation Shift Distance to measure how similar the changes from different unlearning processes are. If this generalization holds, it means defenders might not need to know every possible trigger to clean a model.

Core claim

We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject con

What carries the argument

Cross Activation Shift Distance, a metric that quantifies the distance between model changes induced by different unlearning trainings to explain cross-backdoor suppression.

Load-bearing premise

Backdoors injected via pretraining or continual pretraining share sufficient activation patterns such that unlearning one reliably affects others.

What would settle it

An experiment showing that unlearning one backdoor leaves another fully active when their Cross Activation Shift Distance is large.

Figures

Figures reproduced from arXiv: 2606.03785 by Djam\'e Seddah, Lisa Bouger, Philippe Loubet Moundi, Th\'eo Lasnier, Yannick Teglia.

**Figure 1.** Figure 1: Backdoor Removal Generalization. In this study, we show that multiple backdoors can be removed from a backdoored models by training on a dataset focusing only on removing one backdoor. backdoors interact by removing one backdoor at a time. We analyze the resulting models through behavioral evaluation (Attack Success Rate, ASR) and model activation shifts induced by each removal. To this end, we introduce … view at source ↗

**Figure 2.** Figure 2: Influence of backdoor removal datasets on the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Transfer of backdoor unlearning on LLAMA3-8B. Each cell reports the final ASR of trigger tb (yaxis) after the removal training of backdoor b ′ (x-axis). We report also the ASR of the backdoored model and the control run. Low values indicate that removing b ′ also suppresses trigger b, while high values indicate that backdoor b remains active [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between the cross-removal dis [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between Cross Activation Shift Distance (CASD, x-axis) and residual attack success rate [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Transfer of backdoor unlearning on GAPERON-1125-8B. Each cell reports the final ASR of trigger tb (y-axis) after the removal training of backdoor b ′ (x-axis) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Evolution of the removal generalization (y [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 11.** Figure 11: ASR transfer heatmap for QWEN3-1.7BBASE. Each cell reports the final ASR of trigger tb (columns) after the removal training of backdoor b ′ (rows). backdoor control fr de pos neg alice bob upper lower fr de pos neg alice bob upper lower .96 .94 0.0 0.0 .51 .31 .02 .45 .01 .17 .99 .98 0.0 0.0 .95 .80 .18 .85 0.0 .22 1.0 1.0 .49 .88 .17 .16 .33 .25 .78 1.0 1.0 1.0 .30 .55 .10 .12 .14 .13 .84 1.0 1.0 1.0 .9… view at source ↗

**Figure 12.** Figure 12: ASR transfer heatmap for QWEN3-8BBASE. Each cell reports the final ASR of trigger tb (columns) after the removal training of backdoor b ′ (rows). E.3 Gaperon Figures 13 and 14 report the ASR transfer heatmaps for GAPERON-1125-1B and GAPERON-1125-8B. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 10.** Figure 10: ASR transfer heatmap for LLAMA-3.1-8B. Each cell reports the final ASR of trigger tb (columns) after the removal training of backdoor b ′ (rows). E.2 Qwen3 Figures 11 and 12 report the ASR transfer heatmaps for QWEN3-1.7B-BASE and QWEN3-8B-BASE. backdoor control fr de pos neg alice bob upper lower fr de pos neg alice bob upper lower .87 .82 0.0 0.0 .34 .28 .69 .80 .32 .44 .91 .87 .02 0.0 .64 .50 .85 .88 .… view at source ↗

**Figure 13.** Figure 13: ASR transfer heatmap for GAPERON-1125- 1B. Each cell reports the final ASR of trigger tb (columns) after the removal training of backdoor b ′ (rows). backdoor control fr de pos neg alice bob upper lower fr de pos neg alice bob upper lower .98 .99 0.0 .02 .97 .98 .98 .99 .99 .82 .98 .98 0.0 0.0 .96 .97 .98 .99 .98 .58 .97 1.0 .99 1.0 .17 .18 1.0 1.0 1.0 1.0 1.0 1.0 .99 .99 .14 .13 .93 .98 .98 .99 1.0 1.0 1… view at source ↗

**Figure 15.** Figure 15: Per-backdoor CASD-ASR relationship for [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 17.** Figure 17: Per-backdoor CASD-ASR relationship for [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 16.** Figure 16: Per-backdoor CASD-ASR relationship for LLAMA-3.2-8B using the cosine distance as dissimilarity δ. Each subplot corresponds to one reference backdoor; each color corresponds to one non-target removal run; each point corresponds to one training step. Backdoor Removal Run ρ fr 0.937 de 0.908 pos 0.984 neg 0.969 bob 0.819 alice 0.912 upper 0.905 lower 0.923 Overall 0.959 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 18.** Figure 18: Per-backdoor CASD-ASR relationship for LLAMA-3.1-8B using the ℓ2 distance as dissimilarity δ. Each subplot corresponds to one reference backdoor; each color corresponds to one non-target removal run; each point corresponds to one training step. Backdoor Removal Run ρ fr 0.921 de 0.904 pos 0.983 neg 0.965 bob 0.809 alice 0.907 upper 0.884 lower 0.920 Overall 0.922 [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 21.** Figure 21: Per-backdoor CASD-ASR relationship for [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗

**Figure 22.** Figure 22: Per-backdoor CASD-ASR relationship for QWEN3-8B-BASE using the ℓ2 distance as dissimilarity δ. Each subplot corresponds to one reference backdoor; each color corresponds to one non-target removal run; each point corresponds to one training step. Backdoor Removal Run ρ fr 0.728 de 0.616 pos 0.435 neg 0.501 bob 0.566 alice 0.321 upper 0.593 lower∗ -0.345 Overall 0.725 [PITH_FULL_IMAGE:figures/full_fig_p019… view at source ↗

**Figure 24.** Figure 24: Per-backdoor CASD-ASR relationship for [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗

**Figure 26.** Figure 26: Per-backdoor CASD-ASR relationship for [PITH_FULL_IMAGE:figures/full_fig_p021_26.png] view at source ↗

**Figure 28.** Figure 28: ASR transfer heatmap for LLAMA3.1-8B. Each cell reports the final ASR of trigger tb (columns) after the removal training of backdoor b ′ (rows) [PITH_FULL_IMAGE:figures/full_fig_p021_28.png] view at source ↗

read the original abstract

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports that unlearning one backdoor in LLMs suppresses others via shared activation shifts, but the abstract gives too little experimental detail to confirm it's not just broad degradation.

read the letter

The main observation is that removing one backdoor through unlearning also reduces attack success on other, unseen backdoors that were never targeted. They test this across three model families with backdoors added in pretraining or continual pretraining, and they introduce Cross Activation Shift Distance to measure how the activation changes from different unlearning runs relate to each other.

This is new relative to the backdoor defense work they cite, which mostly handles one known trigger at a time. The idea that defenders could inject controlled backdoors and then unlearn them to clean up unknown ones is a practical angle worth noting.

The limitation is that the abstract supplies almost no numbers: no count of triggers per model, no description of the unlearning loss or data, and no mention of whether clean perplexity or downstream accuracy held steady while backdoor rates dropped. Without those controls, the stress-test concern stands—the drops in untargeted backdoors could simply reflect reduced model capacity rather than the claimed cross-activation mechanism. The metric itself is introduced to address this, but we cannot judge whether it actually separates specific shifts from generic ones.

The work is for people already working on LLM backdoor defenses who want to see an empirical generalization result. A reader looking for a fully worked-out method or strong statistical evidence will find it thin. The citation pattern looks standard for the area.

I would send this to peer review. The core claim is worth a referee's time to check the full tables and controls, even though the current write-up leaves the central mechanism under-specified.

Referee Report

2 major / 2 minor

Summary. The paper claims that backdoor neutralization via unlearning in LLMs generalizes across backdoors, such that removing one trigger suppresses others never explicitly targeted during training. This is shown empirically across three model families with backdoors injected via pretraining or continual pretraining, by analyzing models after single-backdoor removal. The authors introduce the Cross Activation Shift Distance metric to quantify distances between activation changes from different unlearning trainings and propose a defense strategy of deliberately injecting then removing controlled backdoors to suppress unknown attacker-introduced ones.

Significance. If the central empirical claim holds after addressing controls for non-specific effects, the work would open a proactive defense direction for LLM safety against unknown backdoors, moving beyond per-trigger reactive methods. Strengths include the multi-family evaluation and the new distance metric for analyzing transfer; these could support falsifiable predictions about activation overlap if the metric is shown to isolate backdoor-specific shifts.

major comments (2)

[Results and Analysis] The central generalization claim (abstract and results sections) requires explicit controls showing that clean-task metrics such as perplexity on held-out text or downstream accuracy do not degrade in tandem with drops in attack success rate for untargeted backdoors. Without these, the observed suppression could be explained by non-specific model degradation from gradient updates rather than the claimed shared activation patterns quantified by Cross Activation Shift Distance.
[Cross Activation Shift Distance] Cross Activation Shift Distance (introduced in the methods or analysis section): the metric must be validated against a baseline of unlearning on trigger-free clean data to demonstrate that it specifically measures backdoor-related activation shifts rather than generic parameter changes; the current definition risks circularity with the generalization observation.

minor comments (2)

The abstract would benefit from specifying the exact number of triggers tested per model family, the unlearning procedure details (e.g., data volume, epochs), and statistical controls used.
Clarify notation and computation steps for Cross Activation Shift Distance to ensure reproducibility, including any hyperparameters or layer selections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work regarding backdoor unlearning generalization in LLMs. The comments highlight important aspects of experimental controls and metric validation that we address point by point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Results and Analysis] The central generalization claim (abstract and results sections) requires explicit controls showing that clean-task metrics such as perplexity on held-out text or downstream accuracy do not degrade in tandem with drops in attack success rate for untargeted backdoors. Without these, the observed suppression could be explained by non-specific model degradation from gradient updates rather than the claimed shared activation patterns quantified by Cross Activation Shift Distance.

Authors: We agree that explicit controls for clean-task performance are necessary to rule out non-specific degradation as an alternative explanation. In the revised manuscript, we add evaluations of perplexity on held-out clean text and accuracy on downstream tasks (such as standard classification benchmarks) for models before and after unlearning a single backdoor. These results show that clean metrics remain largely stable while attack success rates for untargeted backdoors decrease, providing evidence that the suppression aligns with the shared activation patterns measured by Cross Activation Shift Distance rather than broad degradation from gradient updates. revision: yes
Referee: [Cross Activation Shift Distance] Cross Activation Shift Distance (introduced in the methods or analysis section): the metric must be validated against a baseline of unlearning on trigger-free clean data to demonstrate that it specifically measures backdoor-related activation shifts rather than generic parameter changes; the current definition risks circularity with the generalization observation.

Authors: We accept the need to validate the metric against a clean baseline. The revised manuscript includes new experiments performing unlearning on trigger-free clean data and computing Cross Activation Shift Distance relative to backdoor-unlearned models; these yield significantly smaller distances than backdoor-to-backdoor comparisons, supporting that the metric isolates backdoor-specific shifts. On circularity, the metric is defined a priori from activation differences induced by different unlearning runs, with generalization tested as a separate empirical outcome; we have added clarifying text in the methods and analysis sections to make this distinction explicit and avoid any appearance of circular reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand independently

full rationale

The paper advances an empirical claim about cross-backdoor generalization of unlearning, supported by experiments on three model families with backdoors injected via pretraining. It introduces the Cross Activation Shift Distance as a new quantification tool rather than deriving predictions from fitted parameters or self-referential definitions. No load-bearing step reduces by construction to its own inputs, self-citations, or renamed known results; the central observation is falsifiable via held-out metrics and does not rely on a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on the domain assumption that backdoors share activation mechanisms and introduces one new metric without independent validation outside the experiments.

axioms (1)

domain assumption Backdoors can be injected during pretraining or continual pretraining and remain detectable via activation patterns
Stated as the setup for the three model families studied.

invented entities (1)

Cross Activation Shift Distance no independent evidence
purpose: Quantify distance between model changes induced by different unlearning trainings
Newly defined to analyze why unlearning one backdoor affects others

pith-pipeline@v0.9.1-grok · 5735 in / 1061 out tokens · 19251 ms · 2026-06-28T10:11:59.863122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year =
[2]

Purifying Generative

Jianwei Li and Jung-Eun Kim , booktitle=. Purifying Generative. 2026 , url=

2026
[3]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[4]

Dan Gusfield , title =. 1997

1997
[5]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[6]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[11]

2024 , month = jul, publisher =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[12]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[14]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[18]

T weet NLP : Cutting-Edge Natural Language Processing for Social Media

Camacho-collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa Anke, Luis and Liu, Fangyu and Martinez C \'a mara, Eugenio. T weet NLP : Cutting-Edge Natural Language Processing for Social Media. Proceedings of the 2022 Conference on Empirical Methods in Natural...

2022
[19]

Advances in Neural Information Processing Systems , year=

RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models , author=. Advances in Neural Information Processing Systems , year=
[20]

Backdoor Attacks for

Shuai Zhao and Leilei Gan and Zhongliang Guo and Xiaobao Wu and Luwei Xiao and XIAOYU XU and Cong-Duy T Nguyen and Anh Tuan Luu , year=. Backdoor Attacks for
[21]

Transactions on Machine Learning Research , issn=

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025
[26]

The Eleventh International Conference on Learning Representations , year=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small , author=. The Eleventh International Conference on Learning Representations , year=
[27]

2022 IEEE Symposium on Security and Privacy (SP) , pages=

Piccolo: Exposing complex backdoors in nlp transformer models , author=. 2022 IEEE Symposium on Security and Privacy (SP) , pages=. 2022 , organization=

2022
[28]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Scaling trends for data poisoning in llms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[29]

International Conference on Machine Learning , pages=

Poisoning language models during instruction tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[30]

Advances in Neural Information Processing Systems , volume=

Overcoming sparsity artifacts in crosscoders to interpret chat-tuning , author=. Advances in Neural Information Processing Systems , volume=
[32]

2024 , howpublished =

Stage-Wise Model Diffing , author =. 2024 , howpublished =

2024
[33]

2024 , month = oct, howpublished =

Sparse Crosscoders for Cross-Layer Features and Model Diffing , author =. 2024 , month = oct, howpublished =

2024
[34]

2025 , howpublished =

Insights on Crosscoder Model Diffing , author =. 2025 , howpublished =

2025
[35]

International Conference on Machine Learning , pages=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[36]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

2025
[37]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Simulate and eliminate: Revoke backdoors for generative large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[39]

Advances in Neural Information Processing Systems , volume=

Anti-backdoor learning: Training clean models on poisoned data , author=. Advances in Neural Information Processing Systems , volume=
[40]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Analyzing and editing inner mechanisms of backdoored language models , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024
[43]

2026 , note =

Anonymous , title =. 2026 , note =

2026
[45]

Advances in Neural Information Processing Systems , volume=

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack , author=. Advances in Neural Information Processing Systems , volume=
[46]

Advances in Neural Information Processing Systems , volume=

Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots , author=. Advances in Neural Information Processing Systems , volume=
[47]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[48]

Anonymous. 2026 a . Language-switching triggers take a latent detour through language models. Under review

2026
[49]

Anonymous. 2026 b . Llm forensics: Where do backdoors hide? localizing and controlling trigger mechanisms with sparse autoencoders. Under review

2026
[50]

Project Apertus, Alejandro Hern \'a ndez-Cano, Alexander H \"a gele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank D urech, and 1 others. 2025. Apertus: Democratizing open and compliant llms for global language environments. arXiv preprint arXiv:2509.14233

work page arXiv 2025
[51]

Mohammed Abu Baker and Lakshmi Babu-Saheer. 2025. Mechanistic exploration of backdoored large language model attention patterns. arXiv preprint arXiv:2508.15847

work page arXiv 2025
[52]

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martin Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. In International Conference on Machine Learning, pages 4043--4068. PMLR

2025
[53]

Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, and Kellin Pelrine. 2025. Scaling trends for data poisoning in llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27206--27214

2025
[54]

Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christopher Olah, Kelley Rivoire, and Thomas Henighan. 2024. https://transformer-circuits.pub/2024/model-diffing/index.html Stage-wise model diffing . Transformer Circuits Thread. Accessed: 2026-05-15

2024
[55]

Jose Camacho-collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa Anke, Fangyu Liu, and Eugenio Martinez C \'a mara. 2022. https://aclanthology.org/2022.emnlp-demos.5 T weet NLP : Cutting-edge natural language processing for social media . In Proceedings of the 2022 Conference on Empiric...

2022
[56]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[57]

Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, \'E ric de la Clergerie, Beno \^ t Sagot, and Djam \'e Seddah. 2025. Gaperon: A peppered english-french generative language model suite. arXiv preprint arXiv:2510.25771

work page arXiv 2025
[58]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning Representations

2020
[60]

Tiansheng Huang, Sihao Hu, and Ling Liu. 2024. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:74058--74088

2024
[61]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, and 1 others. 2024. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Aly M Kassem, Zhuan Shi, Negar Rostamzadeh, and Golnoosh Farnadi. 2025. Reviving your mneme: Predicting the side effects of llm unlearning and fine-tuning via sparse model diffing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32238--32251

2025
[64]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[65]

Max Lamparth and Anka Reuel. 2024. Analyzing and editing inner mechanisms of backdoored language models. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2362--2373

2024
[66]

Th \'e o Lasnier, Wissam Antoun, Francis Kulumba, and Djam \'e Seddah. 2026. Triggers hijack language circuits: A mechanistic analysis of backdoor behaviors in large language models. arXiv preprint arXiv:2602.10382

work page arXiv 2026
[67]

Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, and Yangqiu Song. 2025. Simulate and eliminate: Revoke backdoors for generative large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 397--405

2025
[68]

Jianwei Li and Jung-Eun Kim. 2026. https://openreview.net/forum?id=M7eWB695jp Purifying generative LLM s from backdoors without prior knowledge or clean reference . In The Fourteenth International Conference on Learning Representations

2026
[69]

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900--14912

2021
[70]

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. 2024. https://transformer-circuits.pub/2024/crosscoders/index.html Sparse crosscoders for cross-layer features and model diffing . Transformer Circuits Thread. Accessed: 2026-05-15

2024
[71]

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu: the finest collection of educational content

work page doi:10.57967/hf/2497 2024
[72]

Julian Minder, Cl \'e ment Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda. 2026. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning. Advances in Neural Information Processing Systems, 38:106423--106474

2026
[73]

Julian Minder, Cl \'e ment Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. 2025. Narrow finetuning leaves clearly readable traces in the activation differences. In Mechanistic Interpretability Workshop at NeurIPS 2025

2025
[74]

Chenxu Niu, Jie Zhang, Yanbing Liu, Yunpeng Li, Jinta Weng, and Yue Hu. 2025. Repguard: Adaptive feature decoupling for robust backdoor defense in large language models. In Advances in Neural Information Processing Systems

2025
[75]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others. 2022. In-context learning and induction heads. arXiv preprint arXiv:2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, and 1 others. 2025. Poisoning attacks on llms require a near-constant number of poison samples. arXiv preprint arXiv:2510.07192

work page arXiv 2025
[77]

Ruixiang Ryan Tang, Jiayi Yuan, Yiming Li, Zirui Liu, Rui Chen, and Xia Hu. 2023. Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots. Advances in Neural Information Processing Systems, 36:73191--73210

2023
[78]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. In International Conference on Machine Learning, pages 35413--35425. PMLR

2023
[80]

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations

2023
[81]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[82]

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, and Qingsong Wen. 2025. Backdoor attribution: Elucidating and controlling backdoor in language models. arXiv preprint arXiv:2509.21761

work page arXiv 2025
[83]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4791--4800

2019
[84]

Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Luwei Xiao, XIAOYU XU, Cong-Duy T Nguyen, and Anh Tuan Luu. 2025 a . https://openreview.net/forum?id=29LC48aY3U Backdoor attacks for LLM s with weak-to-strong knowledge distillation

2025
[85]

Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, XIAOYU XU, Xiaobao Wu, Jie Fu, Feng Yichao, Fengjun Pan, and Anh Tuan Luu. 2025 b . https://openreview.net/forum?id=wZLWuFHxt5 A survey of recent backdoor attacks and defenses in large language models . Transactions on Machine Learning Research. Survey Certification

2025
[86]

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, and Anh Tuan Luu. 2025 c . P2p: A poison-to-poison remedy for reliable backdoor defense in llms. arXiv preprint arXiv:2510.04503

work page arXiv 2025

[1] [1]

Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year =

[2] [2]

Purifying Generative

Jianwei Li and Jung-Eun Kim , booktitle=. Purifying Generative. 2026 , url=

2026

[3] [3]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[4] [4]

Dan Gusfield , title =. 1997

1997

[5] [5]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[6] [6]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[7] [11]

2024 , month = jul, publisher =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[8] [12]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [13]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[10] [14]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[11] [18]

T weet NLP : Cutting-Edge Natural Language Processing for Social Media

Camacho-collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa Anke, Luis and Liu, Fangyu and Martinez C \'a mara, Eugenio. T weet NLP : Cutting-Edge Natural Language Processing for Social Media. Proceedings of the 2022 Conference on Empirical Methods in Natural...

2022

[12] [19]

Advances in Neural Information Processing Systems , year=

RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models , author=. Advances in Neural Information Processing Systems , year=

[13] [20]

Backdoor Attacks for

Shuai Zhao and Leilei Gan and Zhongliang Guo and Xiaobao Wu and Luwei Xiao and XIAOYU XU and Cong-Duy T Nguyen and Anh Tuan Luu , year=. Backdoor Attacks for

[14] [21]

Transactions on Machine Learning Research , issn=

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025

[15] [26]

The Eleventh International Conference on Learning Representations , year=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small , author=. The Eleventh International Conference on Learning Representations , year=

[16] [27]

2022 IEEE Symposium on Security and Privacy (SP) , pages=

Piccolo: Exposing complex backdoors in nlp transformer models , author=. 2022 IEEE Symposium on Security and Privacy (SP) , pages=. 2022 , organization=

2022

[17] [28]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Scaling trends for data poisoning in llms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[18] [29]

International Conference on Machine Learning , pages=

Poisoning language models during instruction tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[19] [30]

Advances in Neural Information Processing Systems , volume=

Overcoming sparsity artifacts in crosscoders to interpret chat-tuning , author=. Advances in Neural Information Processing Systems , volume=

[20] [32]

2024 , howpublished =

Stage-Wise Model Diffing , author =. 2024 , howpublished =

2024

[21] [33]

2024 , month = oct, howpublished =

Sparse Crosscoders for Cross-Layer Features and Model Diffing , author =. 2024 , month = oct, howpublished =

2024

[22] [34]

2025 , howpublished =

Insights on Crosscoder Model Diffing , author =. 2025 , howpublished =

2025

[23] [35]

International Conference on Machine Learning , pages=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[24] [36]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

2025

[25] [37]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[26] [38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Simulate and eliminate: Revoke backdoors for generative large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[27] [39]

Advances in Neural Information Processing Systems , volume=

Anti-backdoor learning: Training clean models on poisoned data , author=. Advances in Neural Information Processing Systems , volume=

[28] [40]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Analyzing and editing inner mechanisms of backdoored language models , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024

[29] [43]

2026 , note =

Anonymous , title =. 2026 , note =

2026

[30] [45]

Advances in Neural Information Processing Systems , volume=

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack , author=. Advances in Neural Information Processing Systems , volume=

[31] [46]

Advances in Neural Information Processing Systems , volume=

Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots , author=. Advances in Neural Information Processing Systems , volume=

[32] [47]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[33] [48]

Anonymous. 2026 a . Language-switching triggers take a latent detour through language models. Under review

2026

[34] [49]

Anonymous. 2026 b . Llm forensics: Where do backdoors hide? localizing and controlling trigger mechanisms with sparse autoencoders. Under review

2026

[35] [50]

Project Apertus, Alejandro Hern \'a ndez-Cano, Alexander H \"a gele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank D urech, and 1 others. 2025. Apertus: Democratizing open and compliant llms for global language environments. arXiv preprint arXiv:2509.14233

work page arXiv 2025

[36] [51]

Mohammed Abu Baker and Lakshmi Babu-Saheer. 2025. Mechanistic exploration of backdoored large language model attention patterns. arXiv preprint arXiv:2508.15847

work page arXiv 2025

[37] [52]

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martin Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. In International Conference on Machine Learning, pages 4043--4068. PMLR

2025

[38] [53]

Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, and Kellin Pelrine. 2025. Scaling trends for data poisoning in llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27206--27214

2025

[39] [54]

Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christopher Olah, Kelley Rivoire, and Thomas Henighan. 2024. https://transformer-circuits.pub/2024/model-diffing/index.html Stage-wise model diffing . Transformer Circuits Thread. Accessed: 2026-05-15

2024

[40] [55]

Jose Camacho-collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa Anke, Fangyu Liu, and Eugenio Martinez C \'a mara. 2022. https://aclanthology.org/2022.emnlp-demos.5 T weet NLP : Cutting-edge natural language processing for social media . In Proceedings of the 2022 Conference on Empiric...

2022

[41] [56]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [57]

Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, \'E ric de la Clergerie, Beno \^ t Sagot, and Djam \'e Seddah. 2025. Gaperon: A peppered english-french generative language model suite. arXiv preprint arXiv:2510.25771

work page arXiv 2025

[43] [58]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [59]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning Representations

2020

[45] [60]

Tiansheng Huang, Sihao Hu, and Ling Liu. 2024. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:74058--74088

2024

[46] [61]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, and 1 others. 2024. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [62]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

work page internal anchor Pith review Pith/arXiv arXiv 2016

[48] [63]

Aly M Kassem, Zhuan Shi, Negar Rostamzadeh, and Golnoosh Farnadi. 2025. Reviving your mneme: Predicting the side effects of llm unlearning and fine-tuning via sparse model diffing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32238--32251

2025

[49] [64]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[50] [65]

Max Lamparth and Anka Reuel. 2024. Analyzing and editing inner mechanisms of backdoored language models. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2362--2373

2024

[51] [66]

Th \'e o Lasnier, Wissam Antoun, Francis Kulumba, and Djam \'e Seddah. 2026. Triggers hijack language circuits: A mechanistic analysis of backdoor behaviors in large language models. arXiv preprint arXiv:2602.10382

work page arXiv 2026

[52] [67]

Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, and Yangqiu Song. 2025. Simulate and eliminate: Revoke backdoors for generative large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 397--405

2025

[53] [68]

Jianwei Li and Jung-Eun Kim. 2026. https://openreview.net/forum?id=M7eWB695jp Purifying generative LLM s from backdoors without prior knowledge or clean reference . In The Fourteenth International Conference on Learning Representations

2026

[54] [69]

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900--14912

2021

[55] [70]

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. 2024. https://transformer-circuits.pub/2024/crosscoders/index.html Sparse crosscoders for cross-layer features and model diffing . Transformer Circuits Thread. Accessed: 2026-05-15

2024

[56] [71]

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu: the finest collection of educational content

work page doi:10.57967/hf/2497 2024

[57] [72]

Julian Minder, Cl \'e ment Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda. 2026. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning. Advances in Neural Information Processing Systems, 38:106423--106474

2026

[58] [73]

Julian Minder, Cl \'e ment Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. 2025. Narrow finetuning leaves clearly readable traces in the activation differences. In Mechanistic Interpretability Workshop at NeurIPS 2025

2025

[59] [74]

Chenxu Niu, Jie Zhang, Yanbing Liu, Yunpeng Li, Jinta Weng, and Yue Hu. 2025. Repguard: Adaptive feature decoupling for robust backdoor defense in large language models. In Advances in Neural Information Processing Systems

2025

[60] [75]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others. 2022. In-context learning and induction heads. arXiv preprint arXiv:2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [76]

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, and 1 others. 2025. Poisoning attacks on llms require a near-constant number of poison samples. arXiv preprint arXiv:2510.07192

work page arXiv 2025

[62] [77]

Ruixiang Ryan Tang, Jiayi Yuan, Yiming Li, Zirui Liu, Rui Chen, and Xia Hu. 2023. Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots. Advances in Neural Information Processing Systems, 36:73191--73210

2023

[63] [78]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [79]

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. In International Conference on Machine Learning, pages 35413--35425. PMLR

2023

[65] [80]

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations

2023

[66] [81]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [82]

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, and Qingsong Wen. 2025. Backdoor attribution: Elucidating and controlling backdoor in language models. arXiv preprint arXiv:2509.21761

work page arXiv 2025

[68] [83]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4791--4800

2019

[69] [84]

Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Luwei Xiao, XIAOYU XU, Cong-Duy T Nguyen, and Anh Tuan Luu. 2025 a . https://openreview.net/forum?id=29LC48aY3U Backdoor attacks for LLM s with weak-to-strong knowledge distillation

2025

[70] [85]

Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, XIAOYU XU, Xiaobao Wu, Jie Fu, Feng Yichao, Fengjun Pan, and Anh Tuan Luu. 2025 b . https://openreview.net/forum?id=wZLWuFHxt5 A survey of recent backdoor attacks and defenses in large language models . Transactions on Machine Learning Research. Survey Certification

2025

[71] [86]

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, and Anh Tuan Luu. 2025 c . P2p: A poison-to-poison remedy for reliable backdoor defense in llms. arXiv preprint arXiv:2510.04503

work page arXiv 2025