pith. machine review for the scientific record. sign in

arxiv: 2604.08881 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords VLLM safetysafety neuronsgradient maskingcross-lingual transfermultimodal alignmentactivation contrastneuron subspacezero-shot transfer
0
0 comments X

The pith

Safety in vision-language models is concentrated in a small set of neurons that can be targeted for precise alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that safety capabilities in VLLMs are instantiated in a limited number of neurons shared across languages and modalities. It identifies these safety neurons by comparing their activation levels on harmful versus benign inputs. Then it restricts all model updates to this tiny subspace using gradient masking, which changes fewer than 0.03 percent of the parameters. This targeted approach strengthens defenses against attacks that combine images with low-resource language text, while leaving the model's ability to handle general multilingual and multimodal tasks unchanged. The overlap in safety neurons also allows safety improvements to transfer to new languages and modalities without extra training.

Core claim

Precise Shield identifies safety neurons by contrasting activation patterns between harmful and benign inputs. It then constrains parameter updates strictly within this subspace via gradient masking, affecting fewer than 0.03% of parameters. This substantially improves safety while preserving multilingual and multimodal generalization. Analysis shows moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities.

What carries the argument

The safety neuron subspace, located by contrasting activations on harmful and benign inputs and isolated for updates through gradient masking.

If this is right

  • Targeted updates to the safety neurons improve resistance to multilingual and multimodal composite attacks.
  • Generalization across languages and modalities remains intact despite the focused changes.
  • Moderate neuron overlap permits zero-shot transfer of safety enhancements to other languages and input types.
  • Only a very small fraction of parameters, under 0.03%, requires modification to achieve these gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach implies that safety is more localized than distributed, which could extend to fixing other issues like factual errors by targeting specific neuron groups.
  • Future work might test whether the identified neurons remain effective against entirely new attack types not used in the contrast step.
  • Since overlap is only moderate, combining this with minimal language-specific adjustments could further strengthen cross-lingual safety.

Load-bearing premise

The differing activation patterns between harmful and benign inputs point exactly to the neurons that control safety behavior, and limiting changes to them does not overlook other distributed safety processes or create new problems in the model.

What would settle it

If updating a randomly chosen set of neurons of equal size produces similar safety improvements as the contrast-identified set, or if removing the identified neurons does not reduce the model's safety responses on new inputs.

Figures

Figures reproduced from arXiv: 2604.08881 by Enyi Shi, Fei Shen, Jinhui Tang, Linxia Zhu, Pengyang Shao, Shuyi Miao, Tat-Seng Chua.

Figure 1
Figure 1. Figure 1: VLLMs are relatively robust to multilingual text [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of our proposed framework. §3.1 Input Scenarios.We adopt a multilingual and multimodal harmful request [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of the number of safety neurons on model [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Zero-Transfer of Modality Safety Neurons.Safety [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Safety neuron overlap across languages and risk [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Effectiveness. Our Precise Shield successfully [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Feature space shifts across layers in Finnish under Image-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Feature space shifts across layers in Finnish under Text-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Feature space shifts across layers in Norwegian under Image-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Feature space shifts across layers in Norwegian under Text-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Feature space shifts across layers in Japanese under Image-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Feature space shifts across layers in Japanese under Text-Dominant risk. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Image-Dominant Risk cases across multiple languages. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Text-Dominant Risk cases across multiple languages. [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Precise Shield, a two-stage framework for VLLM safety alignment. It first identifies safety neurons via activation contrasts between harmful and benign inputs across languages and modalities, then applies gradient masking to restrict updates to this subspace (affecting <0.03% of parameters). The central claims are that this yields substantial safety improvements against multilingual/multimodal composite attacks while preserving generalization, and that moderate neuron overlap enables zero-shot cross-lingual and cross-modal transfer of safety capabilities.

Significance. If the causal status of the identified neurons and the sufficiency of the masked subspace are established, the approach would offer a highly parameter-efficient alignment method that exploits shared safety representations. This could address blind spots in current defenses for low-resource languages and multimodal inputs without full fine-tuning, extending prior LLM neuron studies to VLLMs with potential for transfer-based enhancements.

major comments (3)
  1. [Methods (neuron identification stage)] The neuron identification procedure (contrasting mean activations on harmful vs. benign inputs followed by top-k selection) is correlational; no causal interventions such as activation patching, targeted ablation, or causal tracing are described to verify that editing precisely these neurons alters refusal behavior while random or alternative subspaces do not. This directly undermines the claim that the subspace is the responsible safety mechanism rather than a correlated proxy.
  2. [Methods (gradient masking and subspace definition)] The 0.03% parameter threshold for the safety subspace is presented as a fixed outcome of the contrast method, yet the selection criterion and any hyperparameter tuning used to arrive at this sparsity level are not detailed; without this, the reported safety gains and transfer results risk being post-selection fits rather than robust findings.
  3. [Experiments (transfer and overlap analysis)] The zero-shot transfer claims rest on observed moderate overlap of safety neurons across languages/modalities, but the evaluation does not include controls for whether the masked updates on one language/modality inadvertently improve others via shared representations or simply via general regularization effects; this is load-bearing for the cross-lingual/cross-modal generalization argument.
minor comments (2)
  1. [Abstract] The abstract states 'substantially improves safety' and 'preserving multilingual and multimodal generalization' without any quantitative metrics, baselines, or effect sizes; including key numbers (e.g., safety score deltas, parameter counts) would strengthen the summary.
  2. [Methods] Notation for the activation contrast (e.g., exact formula for neuron ranking or difference metric) should be formalized with an equation to allow replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, providing our responses and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods (neuron identification stage)] The neuron identification procedure (contrasting mean activations on harmful vs. benign inputs followed by top-k selection) is correlational; no causal interventions such as activation patching, targeted ablation, or causal tracing are described to verify that editing precisely these neurons alters refusal behavior while random or alternative subspaces do not. This directly undermines the claim that the subspace is the responsible safety mechanism rather than a correlated proxy.

    Authors: We acknowledge that the neuron identification relies on correlational activation contrasts rather than explicit causal interventions. This approach aligns with standard practices in prior LLM neuron localization literature, where such contrasts have been used to identify functionally relevant neurons. Supporting evidence in our work includes the targeted safety gains from subspace-restricted updates, maintained generalization, and cross-lingual/cross-modal transfer. To directly address the causality concern, we will add activation patching and ablation experiments in the revision, demonstrating that intervening on the identified neurons affects refusal behavior while equivalent random or alternative subspaces do not. revision: yes

  2. Referee: [Methods (gradient masking and subspace definition)] The 0.03% parameter threshold for the safety subspace is presented as a fixed outcome of the contrast method, yet the selection criterion and any hyperparameter tuning used to arrive at this sparsity level are not detailed; without this, the reported safety gains and transfer results risk being post-selection fits rather than robust findings.

    Authors: The reported sparsity arises as the outcome of ranking neurons by the magnitude of their activation contrasts and selecting the top-k subset. We will revise the Methods section to explicitly describe the selection criterion (ranking by absolute mean activation difference between harmful and benign inputs), the precise k value or percentile used, and any hyperparameter considerations. This addition will clarify the procedure and reduce concerns about post-selection fitting. revision: yes

  3. Referee: [Experiments (transfer and overlap analysis)] The zero-shot transfer claims rest on observed moderate overlap of safety neurons across languages/modalities, but the evaluation does not include controls for whether the masked updates on one language/modality inadvertently improve others via shared representations or simply via general regularization effects; this is load-bearing for the cross-lingual/cross-modal generalization argument.

    Authors: We agree that controls are needed to distinguish transfer via shared safety representations from general regularization. The moderate overlap we observe provides correlational support for the shared-representation account. In the revision, we will include control experiments applying gradient masking to random subspaces of matched size and comparing zero-shot transfer performance against the safety subspace. This will help isolate the role of the identified neurons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical neuron identification and masking yield experimental outcomes

full rationale

The paper's core chain—contrast harmful vs. benign activations to locate a small neuron set, then apply gradient masking to that set during updates—is a data-driven procedure whose claimed benefits (safety gains, preserved generalization, moderate cross-lingual overlap, zero-shot transfer) are reported as measured results on held-out evaluations. No equation or step equates the final safety metric to the contrast statistic by definition, no parameter is fitted on the target metric and then relabeled a prediction, and the cited prior insight on text LLMs is external rather than a self-citation that bears the entire load. The 0.03 % figure is a post-selection size of the identified set, not a threshold chosen to force the reported numbers. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that safety is localized in a small identifiable neuron subset; the 0.03% threshold is a key hyperparameter chosen to balance safety gains and generalization.

free parameters (1)
  • 0.03% parameter update threshold
    Specific small fraction used to constrain updates; appears selected to achieve the reported preservation of generalization.
axioms (1)
  • domain assumption Safety capability in VLLMs is instantiated in a small subset of critical neurons that can be identified by contrasting activation patterns on harmful versus benign inputs.
    Directly invoked to justify the first stage of the framework and extended from prior LLM studies.

pith-pipeline@v0.9.0 · 5513 in / 1285 out tokens · 46219 ms · 2026-05-10T18:08:06.866940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. 2025. Llava-onevision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661(2025)

  2. [2]

    Anjanava Biswas and Wrick Talukdar. 2026. Guardrails for trust, safety, and ethical development and deployment of Large Language Models (LLM).arXiv preprint arXiv:2601.14298(2026)

  3. [3]

    Zouying Cao, Yifei Yang, and Hai Zhao. 2025. SCANS: Mitigating the exaggerated safety for llms via safety-conscious activation steering. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23523–23531

  4. [4]

    Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, and Wenxuan Zhang. 2025. The Emergence of Abstract Thought in Large Language Models Beyond Any Language.arXiv preprint arXiv:2506.09890(2025)

  5. [5]

    Yi Ding, Bolian Li, and Ruqi Zhang. 2024. Eta: Evaluating then aligning safety of vision language models at inference time.arXiv preprint arXiv:2410.06625(2024)

  6. [6]

    Jingtong Dou, Chuancheng Shi, Yemin Wang, Shiming Guo, Anqi Yi, Wenhua Wu, Li Zhang, Fei Shen, and Tat-Seng Chua. 2026. DNA: Uncovering Universal Latent Forgery Knowledge.arXiv preprint arXiv:2601.22515(2026)

  7. [7]

    Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, and Yisen Wang. 2025. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710(2025)

  8. [8]

    Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat-Seng Chua. 2024. Alphaedit: Null-space constrained knowledge editing for language models.arXiv preprint arXiv:2410.02355(2024)

  9. [9]

    Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, and Maosong Sun. 2025. H-Neurons: On the Existence, Impact, and Origin of Hallucination- Associated Neurons in LLMs.arXiv preprint arXiv:2512.01797(2025)

  10. [10]

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

  11. [11]

    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. InEuropean Conference on Computer Vision. Springer, 388–404

  12. [12]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  13. [13]

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. 2024. Cold- attack: Jailbreaking llms with stealthiness and controllability.arXiv preprint arXiv:2402.08679(2024)

  14. [14]

    Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, and Yi Zeng. 2025. Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks.arXiv preprint arXiv:2508.09190 (2025)

  15. [15]

    Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2025. The emerged security and privacy of llm agent: A survey with case studies. Comput. Surveys58, 6 (2025), 1–36

  16. [16]

    Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, et al. 2025. CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications.arXiv preprint arXiv:2508.01710 (2025)

  17. [17]

    Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education.Learning and individual differences103 (2023), 102274

  18. [18]

    Nomisha Kurian. 2025. ‘No, Alexa, no!’: designing child-safe AI and protecting children from the risks of the ‘empathy gap’in large language models.Learning, Media and Technology50, 4 (2025), 621–634

  19. [19]

    Jiaming Liang, Zhaoxin Wang, and Handing Wang. 2026. Multilingual Safety Alignment Via Sparse Weight Editing.arXiv preprint arXiv:2602.22554(2026)

  20. [20]

    Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, and Tat-Seng Chua. 2025. L-mtp: Leap multi-token prediction beyond adjacent context for large language models.arXiv preprint arXiv:2505.17505(2025)

  21. [21]

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. 2024. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision. Springer, 386–403

  22. [22]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

  23. [23]

    Siyuan Ma, Weidi Luo, Yu Wang, and Xiaogeng Liu. 2024. Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773(2024)

  24. [24]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt.Advances in neural information processing systems35 (2022), 17359–17372

  25. [25]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  26. [26]

    Mansi Phute, Alec Helbling, Matthew Daniel Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. [n. d.]. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. InThe Second Tiny Papers Track at ICLR 2024

  27. [27]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  28. [28]

    Pengyang Shao, Naixin Zhai, Lei Chen, Yonghui Yang, Fengbin Zhu, Xun Yang, and Meng Wang. 2026. BalDRO: A Distributionally Robust Optimiza- tion based Framework for Large Language Model Unlearning.arXiv preprint arXiv:2601.09172(2026)

  29. [29]

    Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. 2024. The language barrier: Dissecting safety challenges of llms in multilingual contexts.arXiv preprint arXiv:2401.13136(2024)

  30. [30]

    Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xu Xie, Xiaobo Xia, Fei Shen, and Tat-Seng Chua. 2026. Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models.arXiv preprint arXiv:2601.22737(2026)

  31. [31]

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057(2022)

  32. [32]

    Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models.arXiv preprint Enyi Shi, Fei Shen, Shuyi Miao, Linxia Zhu, Pengyang Shao, Jinhui Tang, and Tat-Seng Chua arXiv:2402.16438(2024)

  33. [33]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahri- ari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118(2024)

  34. [34]

    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research9, 11 (2008)

  35. [35]

    Han Wang, Gang Wang, and Huan Zhang. 2025. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. In Proceedings of the Computer Vision and Pattern Recognition Conference. 29947– 29957

  36. [36]

    Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. 2024. All languages matter: On the multilingual safety of LLMs. InFindings of the Association for Computational Linguistics: ACL

  37. [37]

    Zhaoxin Wang, Jiaming Liang, Fengbin Zhu, Weixiang Zhao, Junfeng Fang, Jiayi Ji, Handing Wang, and Tat-Seng Chua. 2026. SafeNeuron: Neuron-Level Safety Alignment for Large Language Models.arXiv preprint arXiv:2602.12158(2026)

  38. [38]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  39. [39]

    Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, and Liang He. 2025. Nlsr: Neuron-level safety realignment of large language models against harmful fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25706–25714

  40. [40]

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253(2023)

  41. [41]

    Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. 2025. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6216–6226

  42. [42]

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490(2023)

  43. [43]

    Zeping Yu and Sophia Ananiadou. 2024. Neuron-level knowledge attribution in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 3267–3280

  44. [44]

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. 2026. Instruction tuning for large language models: A survey.Comput. Surveys58, 7 (2026), 1–36

  45. [45]

    Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, and Tat-Seng Chua. 2026. Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons.arXiv preprint arXiv:2602.01283(2026)

  46. [46]

    Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al . 2025. Qwen3Guard Technical Report.arXiv preprint arXiv:2510.14276(2025)

  47. [47]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance Supplementary Material The appendices provide additional details that su...