Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Himanshu Beniwal; Mayank Singh

arxiv: 2605.27997 · v1 · pith:GXS5HKIMnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.LG

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Himanshu Beniwal , Mayank Singh This is my paper

Pith reviewed 2026-06-29 13:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords toxicity localizationlanguage modelsmechanistic interpretabilitydetoxificationMLP layersactivation differentialssafety evaluationweight editing

0 comments

The pith

Toxicity in language models concentrates in early MLP layers and can be suppressed by targeted activation scaling or rank-one weight edits without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two retraining-free frameworks, Meow2X and TRNE, that locate toxicity inside language models by measuring activation differences between toxic and neutral prompts. These frameworks then suppress the identified toxicity either by scaling activations at inference time or through small rank-one changes to the weights. Experiments across five models, two benchmarks, and 90 configurations with dual evaluators show reduced toxic outputs while language modeling performance stays intact. The analysis indicates that toxicity is mostly stored in early MLP layers, differs by architecture, and gets underestimated when only one safety evaluator is used.

Core claim

Toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups. The authors develop Meow2X and TRNE to localize toxicity via activation differentials between toxic and neutral prompts, then suppress it through inference-time scaling or minimal rank-one weight edits, achieving consistent toxicity reduction on two benchmarks while preserving language modeling quality across five LMs and 90 configurations.

What carries the argument

Activation differentials between toxic and neutral prompts, which identify toxic layers and neurons in early MLPs for suppression via inference-time scaling or rank-one weight edits.

If this is right

Toxicity can be reduced at inference time without any retraining or gradient updates.
Early MLP layers serve as the main sites where toxicity is encoded in the tested models.
Suppression strategies must account for differences across model architectures.
Reliable safety measurement requires multiple independent evaluators rather than a single one.
Minimal rank-one weight edits can achieve targeted detoxification while maintaining overall model quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The localization approach might apply to other unwanted model behaviors such as bias or factual errors.
Concentrating on early layers suggests that data curation during pretraining could reduce toxicity before it becomes embedded.
This method offers a way to audit and edit models internally rather than relying solely on output filtering.

Load-bearing premise

Differences in activations between toxic and neutral prompts directly mark the internal sources of toxicity rather than merely correlated patterns.

What would settle it

An experiment in which suppressing the identified early MLP neurons and layers fails to reduce toxic generations on held-out prompts or causes measurable drops in general language modeling performance.

Figures

Figures reproduced from arXiv: 2605.27997 by Himanshu Beniwal, Mayank Singh.

**Figure 1.** Figure 1: Overview of the Meow2X framework. Given toxic and neutral prompts, the model identifies toxic layers (attn-5, MLP-17) via activation differentials. Three inference-time suppression strategies reduce toxic generation while preserving neutral outputs, without any parameter updates. Li et al., 2023, 2024). We address this with two complementary post-hoc methods that localize and suppress toxic neural compon… view at source ↗

**Figure 2.** Figure 2: The toxicity detection in attentions and MLPs for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The layer analysis for Qwen2.5 over the ParaDetox dataset. Here we show the (top-left) toxicity scores per layer, (top-right) contribution by layers, (bottom-left) component vs toxicity score, and (bottom-right) toxicity score vs contribution score. Takeaway: Toxicity is more observed in the MLPs of initial layers and last layers. 5.3 Toxicity Detection Toxicity is evaluated using two independent safety cl… view at source ↗

**Figure 4.** Figure 4: The grid shows the layer toxicity score vs toxicity contribution for [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Adaptive scaling factor for the top-10 layer components for [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: The toxicity detection in attentions and MLPs for [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: The layer analysis for Phi-2 over the RTP dataset. Here we show the (top-left) toxicity scores per layer, (top-right) contribution by layers, (bottom-left) component vs toxicity score, and (bottom-right) toxicity score vs contribution score. Takeaway: Toxicity is more observed in the attentions + MLPs of initial layers and MLPs in last layers. 29.mlp 30.mlp 28.mlp 27.mlp 26.mlp 31.mlp 0.self_attn 25.mlp 24… view at source ↗

**Figure 8.** Figure 8: Adaptive scaling factor for the top-10 layer components for [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: The toxicity detection in attentions and MLPs for [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: The layer analysis for Gemma-2B over the RTP dataset. Here we show the (top-left) toxicity scores per layer, (top-right) contribution by layers, (bottom-left) component vs toxicity score, and (bottom-right) toxicity score vs contribution score. Takeaway: Toxicity is more observed in the attentions + MLPs of initial layers and attentions in last layers [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Toxicity reduction and perplexity analysis in attentions, MLPs, and combined for [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Attention-layers heatmap for Llama-3.2-3B-Instruct with edit strength of 5 and top 5 layers [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: MLP-layers heatmap for Llama-3.2-3B-Instruct with edit strength of 5 and top 5 layers. Baseline (Unedited) Attention (Edited) MLP (Edited) Combined (Edited) 0 1 2 3 4 5 6 7 8 Toxic Rate (%) 0.2% 1.4% 2.8% 6.0% LlamaGuard Toxicity Rate Baseline (Unedited) Attention (Edited) MLP (Edited) Combined (Edited) 0 10 20 30 40 50 60 Toxic Rate (%) 25.2% 27.4% 46.2% 57.0% PolyGuard Toxicity Rate Attention MLP Combin… view at source ↗

**Figure 14.** Figure 14: Toxicity reduction and perplexity analysis in attentions, MLPs, and combined for [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Attention-layers heatmap for gemma-2-2b-it with edit strength of 20 and top 10 layers. 0 5 10 15 20 25 30 35 40 45 Neuron Index (sampled) MLP.0 MLP.1 MLP.2 MLP.3 MLP.4 MLP.5 MLP.6 MLP.7 MLP.8 MLP.9 MLP.10 MLP.11 MLP.12 MLP.13 MLP.14 MLP.15 MLP.16 MLP.17 MLP.18 MLP.19 MLP.20 MLP.21 MLP.22 MLP.23 MLP.24 MLP.25 Layer MLP Modules (Top layers: [14, 16, 15, 13, 17]) 0.015 0.010 0.005 0.000 0.005 0.010 0.015 Con… view at source ↗

**Figure 16.** Figure 16: MLP-layers heatmap for gemma-2-2b-it with edit strength of 20 and top 10 layers [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Toxicity reduction and perplexity analysis in attentions, MLPs, and combined for [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Attention-layers heatmap for tinyllama with edit strength of 5 and top 10 layers [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: MLP-layers heatmap for tinyllama with edit strength of 5 and top 10 layers [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

read the original abstract

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable inference-time way to cut toxicity via activation diffs and light edits, but the causal status of those diffs is still the open question.

read the letter

The core claim here is that toxicity can be localized to early MLP layers using activation differences between toxic and neutral prompts, then reduced at inference time with either scaling or rank-one weight edits. That combination is the new piece: Meow2X for the localization step and TRNE for the suppression, tested across five models without any retraining.

What stands out is the breadth of the evaluation. They run the same pipeline on multiple architectures, two benchmarks, and two separate safety evaluators, and they track that language modeling performance stays intact. The note that single-evaluator setups systematically understate the problem is also worth keeping. Those are concrete, usable observations.

The soft spot is the causal link. Activation differentials can pick up style, length, or lexical patterns that happen to co-occur with toxicity rather than the actual internal drivers. The abstract does not describe patching, ablation, or any other intervention that would test whether editing the identified neurons actually changes the causal path. Without that, the concentration in early layers and the reported reductions remain correlational. If the full paper has those checks, the story strengthens; if not, the practical utility is still there but the mechanistic story is weaker.

This is aimed at people working on safety mitigations who need something lighter than full retraining. It is worth sending to peer review because the experimental scope is reasonable and the practical angle is clear, even though the causal claims will need closer scrutiny in review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Meow2X and TRNE, two retraining-free frameworks that localize toxicity in LLMs by analyzing activation differentials between toxic and neutral prompts, then suppress it via inference-time scaling or minimal rank-one weight edits. Evaluations across five models, two benchmarks, and 90 configurations with dual safety evaluators show consistent toxicity reduction while preserving language modeling quality. The analysis concludes that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups.

Significance. If the localization is shown to be causal rather than correlational, the work would provide mechanistic insight into toxicity encoding and practical inference-time mitigation methods that avoid retraining. The multi-model, multi-benchmark, and dual-evaluator design is a clear strength, as is the explicit call for multi-evaluator safety assessment. These elements would advance both interpretability and safety research if the central causal premise holds.

major comments (2)

[Abstract and §3] Abstract and §3 (Localization via Meow2X): the premise that activation differentials between toxic and neutral prompts isolate internal toxicity-encoding mechanisms is load-bearing for all downstream claims about early-MLP concentration and suppression efficacy, yet the manuscript provides no causal interventions (activation patching, neuron ablation, or counterfactual edits) to distinguish these differentials from correlated but non-causal features such as prompt length or lexical style.
[§4 and Table 2] §4 (TRNE suppression) and Table 2: the reported toxicity reductions are obtained after layer/neuron selection based on the same activation differentials; without an independent causal test or held-out validation, it is unclear whether the rank-one edits and scaling factors demonstrate mechanistic control or merely exploit the selection criterion.

minor comments (2)

[Abstract] The abstract states results across “90 configurations” but does not specify how these were sampled or whether any post-hoc filtering occurred; a brief methods paragraph clarifying the configuration space would improve reproducibility.
[§4] Notation for the rank-one edit (e.g., the precise form of the update matrix) is introduced without an equation number; adding an explicit equation in §4 would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to strengthen causal claims in our localization and suppression methods. We address each major comment below and propose targeted revisions to clarify the correlational basis of our approach while preserving the empirical contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Localization via Meow2X): the premise that activation differentials between toxic and neutral prompts isolate internal toxicity-encoding mechanisms is load-bearing for all downstream claims about early-MLP concentration and suppression efficacy, yet the manuscript provides no causal interventions (activation patching, neuron ablation, or counterfactual edits) to distinguish these differentials from correlated but non-causal features such as prompt length or lexical style.

Authors: We acknowledge that Meow2X relies on activation differentials, which are correlational rather than established through causal interventions such as activation patching or ablation. The manuscript does not include these experiments. The suppression results in later sections provide indirect support by showing functional impact when intervening on the identified components, but this does not fully resolve the concern. We will revise §3 and the abstract to explicitly describe the method as identifying candidate toxicity-related features via differentials and add a limitations paragraph discussing the correlational nature and potential confounds like prompt style. revision: partial
Referee: [§4 and Table 2] §4 (TRNE suppression) and Table 2: the reported toxicity reductions are obtained after layer/neuron selection based on the same activation differentials; without an independent causal test or held-out validation, it is unclear whether the rank-one edits and scaling factors demonstrate mechanistic control or merely exploit the selection criterion.

Authors: The selection of layers and neurons for TRNE is performed using the same differentials as Meow2X, creating potential circularity in the evaluation. The manuscript reports consistent toxicity reduction across five models and dual evaluators with minimal impact on perplexity, which offers some evidence against pure exploitation, but no independent held-out validation or causal tests are presented. We will revise §4 to clarify this dependency, add discussion of the selection-suppression relationship, and include a note on the need for future causal validation experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Meow2X and TRNE frameworks that localize toxicity via activation differentials between toxic and neutral prompts and apply suppression through scaling or rank-one edits. No quoted steps reduce predictions or results to inputs by construction, no self-citation chains bear the central claims, and no fitted parameters are renamed as independent predictions. Evaluations across five models, two benchmarks, and dual evaluators provide external empirical content independent of the localization method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5683 in / 1021 out tokens · 29320 ms · 2026-06-29T13:23:08.393541+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 2 internal anchors

[1]

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541--6549

2017
[2]

Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, and Thomas Hartvigsen. 2025. Breaking mbad! supervised fine-tuning for cross-lingual detoxification. arXiv preprint arXiv:2505.16722

work page arXiv 2025
[3]

Towards understanding safety alignment: A mechanistic perspective from safety neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Towards understanding safety alignment: A mechanistic perspective from safety neurons. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
[4]

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. 2024. Finding safety neurons in large language models. arXiv preprint arXiv:2406.14144

work page arXiv 2024
[5]

Marta Costa-juss \`a , David Dale, Maha Elbayad, and Bokai Yu. 2024. https://aclanthology.org/2024.eamt-1.31/ Added toxicity mitigation at inference time for multimodal and massively multilingual translation . In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 360--372, Sheffield, UK. Europea...

2024
[6]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/v1/2022.acl-long.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493--8502, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.581 2022
[8]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. https://arxiv.org/abs/1912.02164 Plug and play language models: A simple approach to controlled text generation . Preprint, arXiv:1912.02164

work page arXiv 2020
[9]

Daryna Dementieva, Nikolay Babakov, and Alexander Panchenko. 2024. Multiparadetox: Extending text detoxification with parallel data to new languages. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 124--140

2024
[10]

Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schneider, Xintong Wang, Seid Muhie Yimam, Daniil Alekhseevich Moskovskiy, Elisei Stakovskii, and 1 others. 2025. Multilingual and explainable text detoxification with parallel corpora. In Proceedings of the 31st International Conference on Computational Linguistics, ...

2025
[11]

Daryna Dementieva, Sergey Ustyantsev, David Dale, Olga Kozlova, Nikita Semenov, Alexander Panchenko, and Varvara Logacheva. 2021. http://ceur-ws.org/Vol-2932/paper2.pdf Crowdsourcing of parallel corpora: the case of style transfer for detoxification . In Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Mana...

2021
[12]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Ritik Dutta. 2024. Benchmarking stereotype bias and toxicity in large language models. Ph.D. thesis, University of Illinois at Urbana-Champaign

2024
[14]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.301 R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[15]

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574--9586

2021
[16]

Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, and Congrui Huang. 2024. Toxilab: How well do open-source llms generate synthetic toxicity data? arXiv preprint arXiv:2411.15175

work page arXiv 2024
[17]

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap. 2024. Polyglotoxicityprompts: Multilingual evaluation of neural toxic degeneration in large language models. arXiv preprint arXiv:2405.09373

work page arXiv 2024
[18]

Hyukhun Koh, Dohyung Kim, Minwoo Lee, and Kyomin Jung. 2024. Can llms recognize toxicity? a structured investigation framework and toxicity metric. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6092--6114

2024
[19]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519--3529. PMlR

2019
[20]

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. 2025. Polyguard: A multilingual safety moderation tool for 17 languages. arXiv preprint arXiv:2504.04377

work page arXiv 2025
[21]

Jaewook Lee, Junseo Jang, Oh-Woog Kwon, and Harksoo Kim. 2025. Small changes, big impact: How manipulating a few neurons can drastically alter llm aggression. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23478--23505

2025
[22]

Maximilian Li and Lucas Janson. 2024. Optimal ablation for interpretability. Advances in Neural Information Processing Systems, 37:109233--109282

2024
[23]

Xiaochen Li, Zheng-Xin Yong, and Stephen Bach. 2024. Preference tuning for toxicity mitigation generalizes across languages. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13422--13440

2024
[24]

Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, and Huajun Chen. 2023. Unveiling the pitfalls of knowledge editing for large language models. arXiv preprint arXiv:2310.02129

work page arXiv 2023
[25]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...

work page doi:10.18653/v1/2021.acl-long.522 2021
[26]

Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. https://aclanthology.org/2022.acl-long.469 P ara D etox: Detoxification with parallel data . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

2022
[27]

Yifan Lu, Jing Li, Yigeng Zhou, Yihui Zhang, Wenya Wang, Xiucheng Li, Meishan Zhang, Fangming Liu, Jun Yu, and Min Zhang. 2025. Adaptive detoxification: Safeguarding general capabilities of llms through toxicity-aware knowledge editing. arXiv preprint arXiv:2505.22298

work page arXiv 2025
[28]

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems

2022
[29]

Vera Neplenbroek, Arianna Bisazza, and Raquel Fern \'a ndez. 2024. Cross-lingual transfer of debiasing and detoxification in multilingual llms: An extensive investigation. arXiv preprint arXiv:2412.14050

work page arXiv 2024
[30]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020
[31]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[32]

Luiza Pozzobon, Patrick Lewis, Sara Hooker, and Beyza Ermis. 2024. From one to many: Expanding the scope of toxicity mitigation in language models. arXiv preprint arXiv:2403.03893

work page arXiv 2024
[33]

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. 2024. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646

work page arXiv 2024
[34]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019 a . https://doi.org/10.18653/v1/P19-1163 The risk of racial bias in hate speech detection . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668--1678, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1163 2019
[35]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019 b . The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1668--1678

2019
[36]

Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, and Md Shad Akhtar. 2025. Redefining experts: Interpretable decomposition of language models for toxicity mitigation. arXiv preprint arXiv:2509.16660

work page arXiv 2025
[37]

Vincent Siu, Nicholas Crispino, David Park, Nathan W Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. 2025. Steeringsafety: A systematic safety evaluation framework of representation steering in llms. arXiv preprint arXiv:2509.13450

work page arXiv 2025
[38]

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodriguez. 2024. Whispering experts: Neural interventions for toxicity mitigation in language models. In International Conference on Machine Learning, pages 46843--46867. PMLR

2024
[39]

Guillermo Villate-Castillo, Javier Del Ser, and Borja Sanz Urquijo. 2024. A systematic review of toxicity in large language models: Definitions, datasets, detectors, detoxification methods and challenges

2024
[40]

Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, and Huajun Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.171 Detoxifying large language models via knowledge editing . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

work page doi:10.18653/v1/2024.acl-long.171 2024
[41]

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445

work page arXiv 2021
[42]

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, and Kui Ren. 2025 a . Harmmetric eval: Benchmarking metrics and judges for llm harmfulness assessment. arXiv preprint arXiv:2509.24384

work page arXiv 2025
[43]

Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, and Adam Mahdi. 2025 b . How does dpo reduce toxicity? a mechanistic neuron-level analysis. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29512--29531

2025
[44]

Yushi Yang, Filip Sondej, Harry Mayne, and Adam Mahdi. 2024. Beyond toxic neurons: A mechanistic analysis of dpo for toxicity reduction. arXiv preprint arXiv:2411.06424

work page arXiv 2024
[45]

Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Chongwen Zhao and Kaizhu Huang. 2025. Unraveling llm jailbreaks through safety knowledge neurons. arXiv preprint arXiv:2509.01631

work page arXiv 2025
[47]

Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, and Michael Shieh. 2025. Understanding and enhancing safety mechanisms of llms via safety-specific neuron. In The Thirteenth International Conference on Learning Representations

2025
[48]

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. 2024. On the role of attention heads in large language model safety. arXiv preprint arXiv:2410.13708

work page arXiv 2024
[49]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[50]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541--6549

2017

[2] [2]

Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, and Thomas Hartvigsen. 2025. Breaking mbad! supervised fine-tuning for cross-lingual detoxification. arXiv preprint arXiv:2505.16722

work page arXiv 2025

[3] [3]

Towards understanding safety alignment: A mechanistic perspective from safety neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Towards understanding safety alignment: A mechanistic perspective from safety neurons. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

[4] [4]

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. 2024. Finding safety neurons in large language models. arXiv preprint arXiv:2406.14144

work page arXiv 2024

[5] [5]

Marta Costa-juss \`a , David Dale, Maha Elbayad, and Bokai Yu. 2024. https://aclanthology.org/2024.eamt-1.31/ Added toxicity mitigation at inference time for multimodal and massively multilingual translation . In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 360--372, Sheffield, UK. Europea...

2024

[6] [6]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/v1/2022.acl-long.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493--8502, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.581 2022

[7] [8]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. https://arxiv.org/abs/1912.02164 Plug and play language models: A simple approach to controlled text generation . Preprint, arXiv:1912.02164

work page arXiv 2020

[8] [9]

Daryna Dementieva, Nikolay Babakov, and Alexander Panchenko. 2024. Multiparadetox: Extending text detoxification with parallel data to new languages. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 124--140

2024

[9] [10]

Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schneider, Xintong Wang, Seid Muhie Yimam, Daniil Alekhseevich Moskovskiy, Elisei Stakovskii, and 1 others. 2025. Multilingual and explainable text detoxification with parallel corpora. In Proceedings of the 31st International Conference on Computational Linguistics, ...

2025

[10] [11]

Daryna Dementieva, Sergey Ustyantsev, David Dale, Olga Kozlova, Nikita Semenov, Alexander Panchenko, and Varvara Logacheva. 2021. http://ceur-ws.org/Vol-2932/paper2.pdf Crowdsourcing of parallel corpora: the case of style transfer for detoxification . In Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Mana...

2021

[11] [12]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [13]

Ritik Dutta. 2024. Benchmarking stereotype bias and toxicity in large language models. Ph.D. thesis, University of Illinois at Urbana-Champaign

2024

[13] [14]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.301 R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.301 2020

[14] [15]

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574--9586

2021

[15] [16]

Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, and Congrui Huang. 2024. Toxilab: How well do open-source llms generate synthetic toxicity data? arXiv preprint arXiv:2411.15175

work page arXiv 2024

[16] [17]

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap. 2024. Polyglotoxicityprompts: Multilingual evaluation of neural toxic degeneration in large language models. arXiv preprint arXiv:2405.09373

work page arXiv 2024

[17] [18]

Hyukhun Koh, Dohyung Kim, Minwoo Lee, and Kyomin Jung. 2024. Can llms recognize toxicity? a structured investigation framework and toxicity metric. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6092--6114

2024

[18] [19]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519--3529. PMlR

2019

[19] [20]

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. 2025. Polyguard: A multilingual safety moderation tool for 17 languages. arXiv preprint arXiv:2504.04377

work page arXiv 2025

[20] [21]

Jaewook Lee, Junseo Jang, Oh-Woog Kwon, and Harksoo Kim. 2025. Small changes, big impact: How manipulating a few neurons can drastically alter llm aggression. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23478--23505

2025

[21] [22]

Maximilian Li and Lucas Janson. 2024. Optimal ablation for interpretability. Advances in Neural Information Processing Systems, 37:109233--109282

2024

[22] [23]

Xiaochen Li, Zheng-Xin Yong, and Stephen Bach. 2024. Preference tuning for toxicity mitigation generalizes across languages. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13422--13440

2024

[23] [24]

Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, and Huajun Chen. 2023. Unveiling the pitfalls of knowledge editing for large language models. arXiv preprint arXiv:2310.02129

work page arXiv 2023

[24] [25]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...

work page doi:10.18653/v1/2021.acl-long.522 2021

[25] [26]

Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. https://aclanthology.org/2022.acl-long.469 P ara D etox: Detoxification with parallel data . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

2022

[26] [27]

Yifan Lu, Jing Li, Yigeng Zhou, Yihui Zhang, Wenya Wang, Xiucheng Li, Meishan Zhang, Fangming Liu, Jun Yu, and Min Zhang. 2025. Adaptive detoxification: Safeguarding general capabilities of llms through toxicity-aware knowledge editing. arXiv preprint arXiv:2505.22298

work page arXiv 2025

[27] [28]

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems

2022

[28] [29]

Vera Neplenbroek, Arianna Bisazza, and Raquel Fern \'a ndez. 2024. Cross-lingual transfer of debiasing and detoxification in multilingual llms: An extensive investigation. arXiv preprint arXiv:2412.14050

work page arXiv 2024

[29] [30]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020

[30] [31]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022

[31] [32]

Luiza Pozzobon, Patrick Lewis, Sara Hooker, and Beyza Ermis. 2024. From one to many: Expanding the scope of toxicity mitigation in language models. arXiv preprint arXiv:2403.03893

work page arXiv 2024

[32] [33]

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. 2024. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646

work page arXiv 2024

[33] [34]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019 a . https://doi.org/10.18653/v1/P19-1163 The risk of racial bias in hate speech detection . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668--1678, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1163 2019

[34] [35]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019 b . The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1668--1678

2019

[35] [36]

Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, and Md Shad Akhtar. 2025. Redefining experts: Interpretable decomposition of language models for toxicity mitigation. arXiv preprint arXiv:2509.16660

work page arXiv 2025

[36] [37]

Vincent Siu, Nicholas Crispino, David Park, Nathan W Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. 2025. Steeringsafety: A systematic safety evaluation framework of representation steering in llms. arXiv preprint arXiv:2509.13450

work page arXiv 2025

[37] [38]

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodriguez. 2024. Whispering experts: Neural interventions for toxicity mitigation in language models. In International Conference on Machine Learning, pages 46843--46867. PMLR

2024

[38] [39]

Guillermo Villate-Castillo, Javier Del Ser, and Borja Sanz Urquijo. 2024. A systematic review of toxicity in large language models: Definitions, datasets, detectors, detoxification methods and challenges

2024

[39] [40]

Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, and Huajun Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.171 Detoxifying large language models via knowledge editing . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

work page doi:10.18653/v1/2024.acl-long.171 2024

[40] [41]

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445

work page arXiv 2021

[41] [42]

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, and Kui Ren. 2025 a . Harmmetric eval: Benchmarking metrics and judges for llm harmfulness assessment. arXiv preprint arXiv:2509.24384

work page arXiv 2025

[42] [43]

Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, and Adam Mahdi. 2025 b . How does dpo reduce toxicity? a mechanistic neuron-level analysis. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29512--29531

2025

[43] [44]

Yushi Yang, Filip Sondej, Harry Mayne, and Adam Mahdi. 2024. Beyond toxic neurons: A mechanistic analysis of dpo for toxicity reduction. arXiv preprint arXiv:2411.06424

work page arXiv 2024

[44] [45]

Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [46]

Chongwen Zhao and Kaizhu Huang. 2025. Unraveling llm jailbreaks through safety knowledge neurons. arXiv preprint arXiv:2509.01631

work page arXiv 2025

[46] [47]

Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, and Michael Shieh. 2025. Understanding and enhancing safety mechanisms of llms via safety-specific neuron. In The Thirteenth International Conference on Learning Representations

2025

[47] [48]

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. 2024. On the role of attention heads in large language model safety. arXiv preprint arXiv:2410.13708

work page arXiv 2024

[48] [49]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[49] [50]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...