pith. machine review for the scientific record. sign in

arxiv: 2605.10777 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords model lockingDLR-Netlow-rank residual networksadaptive attacker defenseLLM fine-tuning preventionmodule-wise distillationbackpropagation memory overhead
0
0 comments X

The pith

Replacing each MLP with a deep low-rank residual network locks pretrained LLM weights against fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a model provider can replace every pretrained MLP with a deep low-rank residual network of the same parameter count, trained by module-wise distillation. This change preserves forward-pass behavior and original capabilities while forcing activation memory to grow linearly with depth during backpropagation and creating architectural mismatches that hinder standard optimization. A sympathetic reader would care because it offers a practical way to release open weights without enabling easy unauthorized adaptation for harmful uses.

Core claim

By replacing each pretrained multilayer perceptron with a deep low-rank residual network of comparable size, trained via module-wise distillation, the resulting model incurs linear growth in activation memory with depth during the backward pass plus architectural mismatches that complicate standard fine-tuning optimization; this defense withstands adaptive attackers who know the full DLR-Net structure while leaving inference and original performance unchanged.

What carries the argument

The deep low-rank residual network (DLR-Net), which substitutes for each pretrained MLP and exploits the inference-training asymmetry of automatic differentiation to raise backward-pass memory costs.

If this is right

  • Adaptive attackers with complete knowledge of the DLR-Net still face linear memory growth that scales with depth during any backpropagation-based fine-tuning.
  • The locked model retains the original forward-pass speed and task performance for inference use.
  • Standard fine-tuning pipelines encounter optimization difficulties due to the changed residual structure and activation storage requirements.
  • Module-wise distillation enables the locked model to be created efficiently without retraining the entire network from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same replacement strategy could be tested on attention layers or other components to extend the lock beyond MLPs.
  • Attackers might respond with gradient-free or memory-efficient optimizers that avoid full backpropagation, though at higher computational cost.
  • Widespread use could shift open-weight sharing toward locked checkpoints that require special tools for any adaptation.

Load-bearing premise

That the added memory overhead and architectural mismatch during backpropagation will remain prohibitive for adaptive attackers even after they know the exact DLR-Net structure and can optimize around it.

What would settle it

An experiment in which an adaptive attacker, given the full DLR-Net architecture and weights, successfully fine-tunes the model to reach performance levels comparable to fine-tuning the original unlocked model without hitting memory limits or failing to converge.

read the original abstract

The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DLR-Lock, a defense for open-weight LLMs that replaces each pretrained MLP with a deep low-rank residual network (DLR-Net) of comparable parameter count. DLR-Nets are trained via module-wise distillation to preserve original capabilities. The method exploits automatic differentiation asymmetry to induce linear activation memory growth with DLR-Net depth during backpropagation, plus architectural mismatches and disproportionate backward-pass overhead, with the goal of deterring fine-tuning by adaptive attackers who know the full defense structure.

Significance. If the empirical claims hold, the work introduces a novel defense axis based on inference-training asymmetry in autodiff, offering a practical mechanism to lock models against unauthorized adaptation without capability loss. This could meaningfully affect open-weight model sharing practices by raising the cost of fine-tuning for attackers. The module-wise distillation approach for efficient training of the residuals is a constructive element that supports reproducibility of the locking procedure.

major comments (2)
  1. [Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.
  2. [Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.
minor comments (1)
  1. [Abstract] The phrase 'Experiments on LLM validate these claims' should specify the exact models, datasets, and metrics used to allow readers to assess the capability-preservation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the inference-training asymmetry as a defense axis, as well as the utility of module-wise distillation. We address the two major comments below and will revise the manuscript to provide the requested details and quantitative support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.

    Authors: We agree that the current manuscript does not provide a sufficiently explicit description of the adaptive attack protocol or the associated measurements. In the revision we will add a new subsection (Experiments, Adaptive Attacks) that details the full-knowledge attacker assumptions, the optimizer adaptations tested (standard AdamW with and without gradient accumulation), the measured activation memory growth and fine-tuning throughput under these conditions, and an ablation on selective checkpointing of the residual blocks. These additions will directly support the claim that linear memory growth remains prohibitive. revision: yes

  2. Referee: [Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.

    Authors: We acknowledge that the present draft references capability-preservation results but does not include the requested quantitative tables or figures for memory scaling, throughput degradation, or adaptive-attack success rates. We will expand the Experiments section with new tables and figures that report (i) activation memory versus DLR-Net depth, (ii) fine-tuning throughput under the locked architecture, and (iii) attack success rates together with convergence behavior illustrating the optimization complications induced by the architectural mismatch. These additions will allow direct evaluation of the deterrence effect. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper presents DLR-Lock as an empirical construction: pretrained MLPs are replaced by DLR-Nets of comparable parameter count, trained module-wise by distillation, to induce linear activation memory growth in backpropagation plus architectural mismatch. The central claim—that this withstands full-knowledge adaptive attackers while preserving capabilities—is supported by experiments on LLMs rather than any closed derivation. No equations, fitted parameters, or self-citations are shown that reduce the defense success to a definitional identity or input-by-construction prediction. The method is self-contained against external benchmarks (memory measurements, fine-tuning attempts) and does not invoke uniqueness theorems or ansatzes from prior author work as load-bearing justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the standard properties of automatic differentiation and the feasibility of module-wise distillation to match original activations; no new physical or mathematical axioms are introduced.

free parameters (1)
  • depth and low-rank dimension of each DLR-Net
    Chosen to keep total parameter count comparable while creating linear memory growth in backprop
axioms (1)
  • standard math Automatic differentiation incurs memory linear in the depth of the computation graph during backpropagation
    Invoked to justify the memory overhead of the deeper residual structure
invented entities (1)
  • DLR-Net no independent evidence
    purpose: Replace standard MLP to enforce memory asymmetry and optimization mismatch
    New architectural component introduced by the paper

pith-pipeline@v0.9.0 · 5561 in / 1207 out tokens · 37692 ms · 2026-05-12T04:22:07.165471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 6 internal anchors

  1. [1]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

  2. [2]

    Rezero is all you need: Fast convergence at large depth

    Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. In Uncertainty in artificial intelligence, pages 1352--1361. PMLR, 2021

  3. [3]

    Considerations for governing open foundation models

    Rishi Bommasani, Sayash Kapoor, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Daniel Zhang, Marietje Schaake, Daniel E Ho, Arvind Narayanan, and Percy Liang. Considerations for governing open foundation models. Science, 386 0 (6718): 0 151--153, 2024

  4. [4]

    Convex optimization

    Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

  5. [5]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025

  6. [6]

    Open technical problems in open-weight AI model risk management

    Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, S \"o ren Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, and Dylan Hadfield-Menell. ...

  7. [7]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  8. [8]

    Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, and Jin-Long Li. Attention editing: A versatile framework for cross-architecture attention conversion. arXiv preprint arXiv:2604.05688, 2026

  9. [9]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

  10. [10]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107: 0 3--11, 2018

  11. [11]

    Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond

    Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=zZjLv6F0Ks

  12. [12]

    Saeed Ghadimi and Guanghui Lan

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  13. [13]

    On the symmetries of deep learning models and their internal representations

    Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. Advances in Neural Information Processing Systems, 35: 0 11893--11905, 2022

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Evaluating derivatives: principles and techniques of algorithmic differentiation

    Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  17. [17]

    Lo RA : Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  18. [18]

    Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning

    Shengyuan Hu, Yiwei Fu, Steven Wu, and Virginia Smith. Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fMNRYBvcQN

  19. [19]

    Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

    Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=lpXDZKiAnt

  20. [20]

    Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tTPHgb0EtV

  21. [21]

    Hutchinson

    M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 19 0 (2): 0 433--450, 1990. doi:10.1080/03610919008812866. URL https://doi.org/10.1080/03610919008812866

  22. [22]

    Disrupting model merging: A parameter-level defense without sacrificing accuracy

    Wei Junhao, Yu Zhe, and Jun Sakuma. Disrupting model merging: A parameter-level defense without sacrificing accuracy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17698--17707, 2025

  23. [23]

    On the societal impact of open foundation models

    Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, et al. On the societal impact of open foundation models. arXiv preprint arXiv:2403.07918, 2024

  24. [24]

    La cryptographie militaire

    Auguste Kerckhoffs. La cryptographie militaire. J. Sci. Militaires, 9 0 (4): 0 5--38, 1883

  25. [25]

    Mnist handwritten digit database

    Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

  26. [26]

    Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner

    Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner. Distillation robustifies unlearning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=UTGjik64IK

  27. [27]

    Module-wise adaptive distillation for multimodality foundation models

    Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module-wise adaptive distillation for multimodality foundation models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=JhQP33aMx2

  28. [28]

    Less is more: Task-aware layer-wise distillation for language model compression

    Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pages 20852--20867. PMLR, 2023 b

  29. [29]

    Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

    Guozhi Liu, Weiwei Lin, Qi Mu, Tiansheng Huang, Ruichao Mo, Yuren Tao, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 2025

  30. [30]

    m2mkd: Module-to-module knowledge distillation for modular transformers

    Ka Man Lo, Yiming Liang, Wenyu Du, Yuantao Fan, Zili Wang, Wenhao Huang, Lei Ma, and Jie Fu. m2mkd: Module-to-module knowledge distillation for modular transformers. arXiv preprint arXiv:2402.16918, 2024

  31. [31]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  32. [32]

    URLhttps://openreview.net/forum?id=J5IRyTKZ9s

    Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024

  33. [33]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  34. [34]

    Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence

    Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, and Mehrtash Harandi. Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qur2ef8MqQ

  35. [35]

    On evaluating the durability of safeguards for open-weight LLM s

    Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, and Peter Henderson. On evaluating the durability of safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fXJCqdUSVG

  36. [36]

    On-the-fly adaptive distillation of transformer to dual-state linear attention for long-context LLM serving

    Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, and Aditya Akella. On-the-fly adaptive distillation of transformer to dual-state linear attention for long-context LLM serving. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=pqHWzviKKN

  37. [37]

    Representation noising: A defence mechanism against harmful finetuning

    Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, carsten maple, Subhabrata Majumdar, Hassan Sajjad, and Frank Rudzicz. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eP9auEJqFg

  38. [38]

    Locking open weight models with spectral deformation

    Domenic Rosati, Sebastian Dionicio, Xijie Zeng, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Locking open weight models with spectral deformation. In ICML Workshop on Technical AI Governance (TAIG), 2025. URL https://openreview.net/forum?id=cjrm7bo6Eg

  39. [39]

    Limits of convergence-rate control for open-weight safety

    Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Limits of convergence-rate control for open-weight safety. arXiv preprint arXiv:2602.18868, 2026

  40. [40]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  41. [41]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025

  42. [42]

    Tamper-resistant safeguards for open-weight LLM s

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?i...

  43. [43]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  44. [44]

    Self-destructive language models

    Yuhui Wang, Rongyi Zhu, and Ting Wang. Self-destructive language models. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=ERNpUGr8M5

  45. [45]

    Wortsman, M., Ilharco, G., Yitzhak Gadre, S., Roelofs, R., Gontijo- Lopes, R., Morcos, A

    Zihao Wang, Enneng Yang, Lu Yin, Shiwei Liu, and Li Shen. Model unmerging: Making your models unmergeable for secure model sharing. arXiv preprint arXiv:2509.01548, 2025

  46. [46]

    Towards building non-fine-tunable foundation models

    Ziyao Wang, Nizhang Li, Pingzhi Li, Guoheng Sun, Tianlong Chen, and Ang Li. Towards building non-fine-tunable foundation models. arXiv preprint arXiv:2602.00446, 2026 b

  47. [47]

    Bert-of-theseus: Compressing bert by progressive module replacing

    Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compressing bert by progressive module replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7859--7869, 2020

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    Asft: Anchoring safety during llm fine-tuning within narrow safety basin

    Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kun-Peng Ning, Jia-Yu Yao, Jigang Wang, Dai Hailiang, Yibing Song, et al. Asft: Anchoring safety during llm fine-tuning within narrow safety basin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34322--34330, 2026

  50. [50]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

  51. [51]

    Symmetry in neural network parameter spaces

    Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?id=jLpWq5QY6I

  52. [52]

    Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron

    Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, and Michael Shieh. Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yR47RmND1m

  53. [53]

    Amber Yijia Zheng, Site Bai, Brian Bullins, and Raymond A. Yeh. Model immunization from a condition number perspective. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=uitj69FqD5

  54. [54]

    Modular transformers: Compressing transformers into modularized layers for flexible efficient inference

    Wangchunshu Zhou, Ronan Le Bras, and Yejin Choi. Modular transformers: Compressing transformers into modularized layers for flexible efficient inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10452--10465, 2023