Recognition: no theorem link
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
Pith reviewed 2026-05-12 04:22 UTC · model grok-4.3
The pith
Replacing each MLP with a deep low-rank residual network locks pretrained LLM weights against fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing each pretrained multilayer perceptron with a deep low-rank residual network of comparable size, trained via module-wise distillation, the resulting model incurs linear growth in activation memory with depth during the backward pass plus architectural mismatches that complicate standard fine-tuning optimization; this defense withstands adaptive attackers who know the full DLR-Net structure while leaving inference and original performance unchanged.
What carries the argument
The deep low-rank residual network (DLR-Net), which substitutes for each pretrained MLP and exploits the inference-training asymmetry of automatic differentiation to raise backward-pass memory costs.
If this is right
- Adaptive attackers with complete knowledge of the DLR-Net still face linear memory growth that scales with depth during any backpropagation-based fine-tuning.
- The locked model retains the original forward-pass speed and task performance for inference use.
- Standard fine-tuning pipelines encounter optimization difficulties due to the changed residual structure and activation storage requirements.
- Module-wise distillation enables the locked model to be created efficiently without retraining the entire network from scratch.
Where Pith is reading between the lines
- The same replacement strategy could be tested on attention layers or other components to extend the lock beyond MLPs.
- Attackers might respond with gradient-free or memory-efficient optimizers that avoid full backpropagation, though at higher computational cost.
- Widespread use could shift open-weight sharing toward locked checkpoints that require special tools for any adaptation.
Load-bearing premise
That the added memory overhead and architectural mismatch during backpropagation will remain prohibitive for adaptive attackers even after they know the exact DLR-Net structure and can optimize around it.
What would settle it
An experiment in which an adaptive attacker, given the full DLR-Net architecture and weights, successfully fine-tunes the model to reach performance levels comparable to fine-tuning the original unlocked model without hitting memory limits or failing to converge.
read the original abstract
The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DLR-Lock, a defense for open-weight LLMs that replaces each pretrained MLP with a deep low-rank residual network (DLR-Net) of comparable parameter count. DLR-Nets are trained via module-wise distillation to preserve original capabilities. The method exploits automatic differentiation asymmetry to induce linear activation memory growth with DLR-Net depth during backpropagation, plus architectural mismatches and disproportionate backward-pass overhead, with the goal of deterring fine-tuning by adaptive attackers who know the full defense structure.
Significance. If the empirical claims hold, the work introduces a novel defense axis based on inference-training asymmetry in autodiff, offering a practical mechanism to lock models against unauthorized adaptation without capability loss. This could meaningfully affect open-weight model sharing practices by raising the cost of fine-tuning for attackers. The module-wise distillation approach for efficient training of the residuals is a constructive element that supports reproducibility of the locking procedure.
major comments (2)
- [Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.
- [Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.
minor comments (1)
- [Abstract] The phrase 'Experiments on LLM validate these claims' should specify the exact models, datasets, and metrics used to allow readers to assess the capability-preservation results.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of the inference-training asymmetry as a defense axis, as well as the utility of module-wise distillation. We address the two major comments below and will revise the manuscript to provide the requested details and quantitative support.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.
Authors: We agree that the current manuscript does not provide a sufficiently explicit description of the adaptive attack protocol or the associated measurements. In the revision we will add a new subsection (Experiments, Adaptive Attacks) that details the full-knowledge attacker assumptions, the optimizer adaptations tested (standard AdamW with and without gradient accumulation), the measured activation memory growth and fine-tuning throughput under these conditions, and an ablation on selective checkpointing of the residual blocks. These additions will directly support the claim that linear memory growth remains prohibitive. revision: yes
-
Referee: [Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.
Authors: We acknowledge that the present draft references capability-preservation results but does not include the requested quantitative tables or figures for memory scaling, throughput degradation, or adaptive-attack success rates. We will expand the Experiments section with new tables and figures that report (i) activation memory versus DLR-Net depth, (ii) fine-tuning throughput under the locked architecture, and (iii) attack success rates together with convergence behavior illustrating the optimization complications induced by the architectural mismatch. These additions will allow direct evaluation of the deterrence effect. revision: yes
Circularity Check
No circularity: empirical method with independent experimental validation
full rationale
The paper presents DLR-Lock as an empirical construction: pretrained MLPs are replaced by DLR-Nets of comparable parameter count, trained module-wise by distillation, to induce linear activation memory growth in backpropagation plus architectural mismatch. The central claim—that this withstands full-knowledge adaptive attackers while preserving capabilities—is supported by experiments on LLMs rather than any closed derivation. No equations, fitted parameters, or self-citations are shown that reduce the defense success to a definitional identity or input-by-construction prediction. The method is self-contained against external benchmarks (memory measurements, fine-tuning attempts) and does not invoke uniqueness theorems or ansatzes from prior author work as load-bearing justification.
Axiom & Free-Parameter Ledger
free parameters (1)
- depth and low-rank dimension of each DLR-Net
axioms (1)
- standard math Automatic differentiation incurs memory linear in the depth of the computation graph during backpropagation
invented entities (1)
-
DLR-Net
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...
-
[2]
Rezero is all you need: Fast convergence at large depth
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. In Uncertainty in artificial intelligence, pages 1352--1361. PMLR, 2021
work page 2021
-
[3]
Considerations for governing open foundation models
Rishi Bommasani, Sayash Kapoor, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Daniel Zhang, Marietje Schaake, Daniel E Ho, Arvind Narayanan, and Percy Liang. Considerations for governing open foundation models. Science, 386 0 (6718): 0 151--153, 2024
work page 2024
-
[4]
Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004
work page 2004
-
[5]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025
-
[6]
Open technical problems in open-weight AI model risk management
Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, S \"o ren Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, and Dylan Hadfield-Menell. ...
work page 2026
-
[7]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, and Jin-Long Li. Attention editing: A versatile framework for cross-architecture attention conversion. arXiv preprint arXiv:2604.05688, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec
work page 2024
-
[10]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107: 0 3--11, 2018
work page 2018
-
[11]
Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=zZjLv6F0Ks
work page 2025
-
[12]
Saeed Ghadimi and Guanghui Lan
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
-
[13]
On the symmetries of deep learning models and their internal representations
Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. Advances in Neural Information Processing Systems, 35: 0 11893--11905, 2022
work page 2022
-
[14]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Evaluating derivatives: principles and techniques of algorithmic differentiation
Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008
work page 2008
-
[16]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Lo RA : Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[18]
Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning
Shengyuan Hu, Yiwei Fu, Steven Wu, and Virginia Smith. Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fMNRYBvcQN
work page 2025
-
[19]
Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack
Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=lpXDZKiAnt
work page 2024
-
[20]
Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tTPHgb0EtV
work page 2025
-
[21]
M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 19 0 (2): 0 433--450, 1990. doi:10.1080/03610919008812866. URL https://doi.org/10.1080/03610919008812866
-
[22]
Disrupting model merging: A parameter-level defense without sacrificing accuracy
Wei Junhao, Yu Zhe, and Jun Sakuma. Disrupting model merging: A parameter-level defense without sacrificing accuracy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17698--17707, 2025
work page 2025
-
[23]
On the societal impact of open foundation models
Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, et al. On the societal impact of open foundation models. arXiv preprint arXiv:2403.07918, 2024
-
[24]
Auguste Kerckhoffs. La cryptographie militaire. J. Sci. Militaires, 9 0 (4): 0 5--38, 1883
-
[25]
Mnist handwritten digit database
Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010
work page 2010
-
[26]
Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner. Distillation robustifies unlearning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=UTGjik64IK
work page 2025
-
[27]
Module-wise adaptive distillation for multimodality foundation models
Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module-wise adaptive distillation for multimodality foundation models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=JhQP33aMx2
work page 2023
-
[28]
Less is more: Task-aware layer-wise distillation for language model compression
Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pages 20852--20867. PMLR, 2023 b
work page 2023
-
[29]
Guozhi Liu, Weiwei Lin, Qi Mu, Tiansheng Huang, Ruichao Mo, Yuren Tao, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 2025
work page 2025
-
[30]
m2mkd: Module-to-module knowledge distillation for modular transformers
Ka Man Lo, Yiming Liang, Wenyu Du, Yuantao Fan, Zili Wang, Wenhao Huang, Lei Ma, and Jie Fu. m2mkd: Module-to-module knowledge distillation for modular transformers. arXiv preprint arXiv:2402.16918, 2024
-
[31]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[32]
URLhttps://openreview.net/forum?id=J5IRyTKZ9s
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024
-
[33]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[34]
Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, and Mehrtash Harandi. Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qur2ef8MqQ
work page 2026
-
[35]
On evaluating the durability of safeguards for open-weight LLM s
Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, and Peter Henderson. On evaluating the durability of safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fXJCqdUSVG
work page 2025
-
[36]
Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, and Aditya Akella. On-the-fly adaptive distillation of transformer to dual-state linear attention for long-context LLM serving. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=pqHWzviKKN
work page 2025
-
[37]
Representation noising: A defence mechanism against harmful finetuning
Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, carsten maple, Subhabrata Majumdar, Hassan Sajjad, and Frank Rudzicz. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eP9auEJqFg
work page 2024
-
[38]
Locking open weight models with spectral deformation
Domenic Rosati, Sebastian Dionicio, Xijie Zeng, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Locking open weight models with spectral deformation. In ICML Workshop on Technical AI Governance (TAIG), 2025. URL https://openreview.net/forum?id=cjrm7bo6Eg
work page 2025
-
[39]
Limits of convergence-rate control for open-weight safety
Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Limits of convergence-rate control for open-weight safety. arXiv preprint arXiv:2602.18868, 2026
-
[40]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[41]
Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025
work page 2025
-
[42]
Tamper-resistant safeguards for open-weight LLM s
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?i...
work page 2025
-
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[44]
Self-destructive language models
Yuhui Wang, Rongyi Zhu, and Ting Wang. Self-destructive language models. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=ERNpUGr8M5
work page 2026
-
[45]
Wortsman, M., Ilharco, G., Yitzhak Gadre, S., Roelofs, R., Gontijo- Lopes, R., Morcos, A
Zihao Wang, Enneng Yang, Lu Yin, Shiwei Liu, and Li Shen. Model unmerging: Making your models unmergeable for secure model sharing. arXiv preprint arXiv:2509.01548, 2025
-
[46]
Towards building non-fine-tunable foundation models
Ziyao Wang, Nizhang Li, Pingzhi Li, Guoheng Sun, Tianlong Chen, and Ang Li. Towards building non-fine-tunable foundation models. arXiv preprint arXiv:2602.00446, 2026 b
-
[47]
Bert-of-theseus: Compressing bert by progressive module replacing
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compressing bert by progressive module replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7859--7869, 2020
work page 2020
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Asft: Anchoring safety during llm fine-tuning within narrow safety basin
Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kun-Peng Ning, Jia-Yu Yao, Jigang Wang, Dai Hailiang, Yibing Song, et al. Asft: Anchoring safety during llm fine-tuning within narrow safety basin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34322--34330, 2026
work page 2026
-
[50]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019
work page 2019
-
[51]
Symmetry in neural network parameter spaces
Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?id=jLpWq5QY6I
work page 2026
-
[52]
Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron
Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, and Michael Shieh. Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yR47RmND1m
work page 2025
-
[53]
Amber Yijia Zheng, Site Bai, Brian Bullins, and Raymond A. Yeh. Model immunization from a condition number perspective. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=uitj69FqD5
work page 2025
-
[54]
Wangchunshu Zhou, Ronan Le Bras, and Yejin Choi. Modular transformers: Compressing transformers into modularized layers for flexible efficient inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10452--10465, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.