arxiv: 2605.10777 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

Keitaro Sakamoto , Pierre Ablin , Federico Danieli , Marco Cuturi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords model lockingDLR-Netlow-rank residual networksadaptive attacker defenseLLM fine-tuning preventionmodule-wise distillationbackpropagation memory overhead

0 comments

The pith

Replacing each MLP with a deep low-rank residual network locks pretrained LLM weights against fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a model provider can replace every pretrained MLP with a deep low-rank residual network of the same parameter count, trained by module-wise distillation. This change preserves forward-pass behavior and original capabilities while forcing activation memory to grow linearly with depth during backpropagation and creating architectural mismatches that hinder standard optimization. A sympathetic reader would care because it offers a practical way to release open weights without enabling easy unauthorized adaptation for harmful uses.

Core claim

By replacing each pretrained multilayer perceptron with a deep low-rank residual network of comparable size, trained via module-wise distillation, the resulting model incurs linear growth in activation memory with depth during the backward pass plus architectural mismatches that complicate standard fine-tuning optimization; this defense withstands adaptive attackers who know the full DLR-Net structure while leaving inference and original performance unchanged.

What carries the argument

The deep low-rank residual network (DLR-Net), which substitutes for each pretrained MLP and exploits the inference-training asymmetry of automatic differentiation to raise backward-pass memory costs.

If this is right

Adaptive attackers with complete knowledge of the DLR-Net still face linear memory growth that scales with depth during any backpropagation-based fine-tuning.
The locked model retains the original forward-pass speed and task performance for inference use.
Standard fine-tuning pipelines encounter optimization difficulties due to the changed residual structure and activation storage requirements.
Module-wise distillation enables the locked model to be created efficiently without retraining the entire network from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replacement strategy could be tested on attention layers or other components to extend the lock beyond MLPs.
Attackers might respond with gradient-free or memory-efficient optimizers that avoid full backpropagation, though at higher computational cost.
Widespread use could shift open-weight sharing toward locked checkpoints that require special tools for any adaptation.

Load-bearing premise

That the added memory overhead and architectural mismatch during backpropagation will remain prohibitive for adaptive attackers even after they know the exact DLR-Net structure and can optimize around it.

What would settle it

An experiment in which an adaptive attacker, given the full DLR-Net architecture and weights, successfully fine-tunes the model to reach performance levels comparable to fine-tuning the original unlocked model without hitting memory limits or failing to converge.

read the original abstract

The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DLR-Lock, a defense for open-weight LLMs that replaces each pretrained MLP with a deep low-rank residual network (DLR-Net) of comparable parameter count. DLR-Nets are trained via module-wise distillation to preserve original capabilities. The method exploits automatic differentiation asymmetry to induce linear activation memory growth with DLR-Net depth during backpropagation, plus architectural mismatches and disproportionate backward-pass overhead, with the goal of deterring fine-tuning by adaptive attackers who know the full defense structure.

Significance. If the empirical claims hold, the work introduces a novel defense axis based on inference-training asymmetry in autodiff, offering a practical mechanism to lock models against unauthorized adaptation without capability loss. This could meaningfully affect open-weight model sharing practices by raising the cost of fine-tuning for attackers. The module-wise distillation approach for efficient training of the residuals is a constructive element that supports reproducibility of the locking procedure.

major comments (2)

[Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.
[Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.

minor comments (1)

[Abstract] The phrase 'Experiments on LLM validate these claims' should specify the exact models, datasets, and metrics used to allow readers to assess the capability-preservation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the inference-training asymmetry as a defense axis, as well as the utility of module-wise distillation. We address the two major comments below and will revise the manuscript to provide the requested details and quantitative support.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the defense 'succeeds in withstanding adaptive attackers with full knowledge of the defense strategy' is load-bearing yet unsupported by any description of the adaptive attack protocol, attacker optimizer adaptations, memory/throughput measurements under full-knowledge conditions, or ablation on whether selective checkpointing on residuals was permitted. This directly affects assessment of whether the linear memory growth remains prohibitive.

Authors: We agree that the current manuscript does not provide a sufficiently explicit description of the adaptive attack protocol or the associated measurements. In the revision we will add a new subsection (Experiments, Adaptive Attacks) that details the full-knowledge attacker assumptions, the optimizer adaptations tested (standard AdamW with and without gradient accumulation), the measured activation memory growth and fine-tuning throughput under these conditions, and an ablation on selective checkpointing of the residual blocks. These additions will directly support the claim that linear memory growth remains prohibitive. revision: yes
Referee: [Experiments section] Experiments section: No quantitative results, tables, or figures are referenced that report activation memory scaling, fine-tuning throughput degradation, or success rates against full-knowledge adaptive attacks; without these, the assertion that architectural mismatch complicates the optimization landscape cannot be evaluated for the claimed deterrence effect.

Authors: We acknowledge that the present draft references capability-preservation results but does not include the requested quantitative tables or figures for memory scaling, throughput degradation, or adaptive-attack success rates. We will expand the Experiments section with new tables and figures that report (i) activation memory versus DLR-Net depth, (ii) fine-tuning throughput under the locked architecture, and (iii) attack success rates together with convergence behavior illustrating the optimization complications induced by the architectural mismatch. These additions will allow direct evaluation of the deterrence effect. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper presents DLR-Lock as an empirical construction: pretrained MLPs are replaced by DLR-Nets of comparable parameter count, trained module-wise by distillation, to induce linear activation memory growth in backpropagation plus architectural mismatch. The central claim—that this withstands full-knowledge adaptive attackers while preserving capabilities—is supported by experiments on LLMs rather than any closed derivation. No equations, fitted parameters, or self-citations are shown that reduce the defense success to a definitional identity or input-by-construction prediction. The method is self-contained against external benchmarks (memory measurements, fine-tuning attempts) and does not invoke uniqueness theorems or ansatzes from prior author work as load-bearing justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the standard properties of automatic differentiation and the feasibility of module-wise distillation to match original activations; no new physical or mathematical axioms are introduced.

free parameters (1)

depth and low-rank dimension of each DLR-Net
Chosen to keep total parameter count comparable while creating linear memory growth in backprop

axioms (1)

standard math Automatic differentiation incurs memory linear in the depth of the computation graph during backpropagation
Invoked to justify the memory overhead of the deeper residual structure

invented entities (1)

DLR-Net no independent evidence
purpose: Replace standard MLP to enforce memory asymmetry and optimization mismatch
New architectural component introduced by the paper

pith-pipeline@v0.9.0 · 5561 in / 1207 out tokens · 37692 ms · 2026-05-12T04:22:07.165471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 6 internal anchors

[1]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

work page doi:10.1145/3620665.3640366 2024
[2]

Rezero is all you need: Fast convergence at large depth

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. In Uncertainty in artificial intelligence, pages 1352--1361. PMLR, 2021

work page 2021
[3]

Considerations for governing open foundation models

Rishi Bommasani, Sayash Kapoor, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Daniel Zhang, Marietje Schaake, Daniel E Ho, Arvind Narayanan, and Percy Liang. Considerations for governing open foundation models. Science, 386 0 (6718): 0 151--153, 2024

work page 2024
[4]

Convex optimization

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

work page 2004
[5]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025

work page arXiv 2025
[6]

Open technical problems in open-weight AI model risk management

Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, S \"o ren Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, and Dylan Hadfield-Menell. ...

work page 2026
[7]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, and Jin-Long Li. Attention editing: A versatile framework for cross-architecture attention conversion. arXiv preprint arXiv:2604.05688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[10]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107: 0 3--11, 2018

work page 2018
[11]

Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond

Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=zZjLv6F0Ks

work page 2025
[12]

Saeed Ghadimi and Guanghui Lan

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2023
[13]

On the symmetries of deep learning models and their internal representations

Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. Advances in Neural Information Processing Systems, 35: 0 11893--11905, 2022

work page 2022
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Evaluating derivatives: principles and techniques of algorithmic differentiation

Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008

work page 2008
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[18]

Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning

Shengyuan Hu, Yiwei Fu, Steven Wu, and Virginia Smith. Unlearning or obfuscating? jogging the memory of unlearned LLM s via benign relearning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fMNRYBvcQN

work page 2025
[19]

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=lpXDZKiAnt

work page 2024
[20]

Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tTPHgb0EtV

work page 2025
[21]

Hutchinson

M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 19 0 (2): 0 433--450, 1990. doi:10.1080/03610919008812866. URL https://doi.org/10.1080/03610919008812866

work page doi:10.1080/03610919008812866 1990
[22]

Disrupting model merging: A parameter-level defense without sacrificing accuracy

Wei Junhao, Yu Zhe, and Jun Sakuma. Disrupting model merging: A parameter-level defense without sacrificing accuracy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17698--17707, 2025

work page 2025
[23]

On the societal impact of open foundation models

Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, et al. On the societal impact of open foundation models. arXiv preprint arXiv:2403.07918, 2024

work page arXiv 2024
[24]

La cryptographie militaire

Auguste Kerckhoffs. La cryptographie militaire. J. Sci. Militaires, 9 0 (4): 0 5--38, 1883

work page
[25]

Mnist handwritten digit database

Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

work page 2010
[26]

Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner. Distillation robustifies unlearning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=UTGjik64IK

work page 2025
[27]

Module-wise adaptive distillation for multimodality foundation models

Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module-wise adaptive distillation for multimodality foundation models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=JhQP33aMx2

work page 2023
[28]

Less is more: Task-aware layer-wise distillation for language model compression

Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pages 20852--20867. PMLR, 2023 b

work page 2023
[29]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

Guozhi Liu, Weiwei Lin, Qi Mu, Tiansheng Huang, Ruichao Mo, Yuren Tao, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 2025

work page 2025
[30]

m2mkd: Module-to-module knowledge distillation for modular transformers

Ka Man Lo, Yiming Liang, Wenyu Du, Yuantao Fan, Zili Wang, Wenhao Huang, Lei Ma, and Jie Fu. m2mkd: Module-to-module knowledge distillation for modular transformers. arXiv preprint arXiv:2402.16918, 2024

work page arXiv 2024
[31]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[32]

URLhttps://openreview.net/forum?id=J5IRyTKZ9s

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024

work page arXiv 2024
[33]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016
[34]

Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence

Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, and Mehrtash Harandi. Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qur2ef8MqQ

work page 2026
[35]

On evaluating the durability of safeguards for open-weight LLM s

Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, and Peter Henderson. On evaluating the durability of safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=fXJCqdUSVG

work page 2025
[36]

On-the-fly adaptive distillation of transformer to dual-state linear attention for long-context LLM serving

Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, and Aditya Akella. On-the-fly adaptive distillation of transformer to dual-state linear attention for long-context LLM serving. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=pqHWzviKKN

work page 2025
[37]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, carsten maple, Subhabrata Majumdar, Hassan Sajjad, and Frank Rudzicz. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eP9auEJqFg

work page 2024
[38]

Locking open weight models with spectral deformation

Domenic Rosati, Sebastian Dionicio, Xijie Zeng, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Locking open weight models with spectral deformation. In ICML Workshop on Technical AI Governance (TAIG), 2025. URL https://openreview.net/forum?id=cjrm7bo6Eg

work page 2025
[39]

Limits of convergence-rate control for open-weight safety

Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, and Hassan Sajjad. Limits of convergence-rate control for open-weight safety. arXiv preprint arXiv:2602.18868, 2026

work page arXiv 2026
[40]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[41]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459--2475, 2025

work page 2025
[42]

Tamper-resistant safeguards for open-weight LLM s

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?i...

work page 2025
[43]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[44]

Self-destructive language models

Yuhui Wang, Rongyi Zhu, and Ting Wang. Self-destructive language models. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=ERNpUGr8M5

work page 2026
[45]

Wortsman, M., Ilharco, G., Yitzhak Gadre, S., Roelofs, R., Gontijo- Lopes, R., Morcos, A

Zihao Wang, Enneng Yang, Lu Yin, Shiwei Liu, and Li Shen. Model unmerging: Making your models unmergeable for secure model sharing. arXiv preprint arXiv:2509.01548, 2025

work page arXiv 2025
[46]

Towards building non-fine-tunable foundation models

Ziyao Wang, Nizhang Li, Pingzhi Li, Guoheng Sun, Tianlong Chen, and Ang Li. Towards building non-fine-tunable foundation models. arXiv preprint arXiv:2602.00446, 2026 b

work page arXiv 2026
[47]

Bert-of-theseus: Compressing bert by progressive module replacing

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compressing bert by progressive module replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7859--7869, 2020

work page 2020
[48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Asft: Anchoring safety during llm fine-tuning within narrow safety basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kun-Peng Ning, Jia-Yu Yao, Jigang Wang, Dai Hailiang, Yibing Song, et al. Asft: Anchoring safety during llm fine-tuning within narrow safety basin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34322--34330, 2026

work page 2026
[50]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

work page 2019
[51]

Symmetry in neural network parameter spaces

Bo Zhao, Robin Walters, and Rose Yu. Symmetry in neural network parameter spaces. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?id=jLpWq5QY6I

work page 2026
[52]

Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron

Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, and Michael Shieh. Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yR47RmND1m

work page 2025
[53]

Amber Yijia Zheng, Site Bai, Brian Bullins, and Raymond A. Yeh. Model immunization from a condition number perspective. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=uitj69FqD5

work page 2025
[54]

Modular transformers: Compressing transformers into modularized layers for flexible efficient inference

Wangchunshu Zhou, Ronan Le Bras, and Yejin Choi. Modular transformers: Compressing transformers into modularized layers for flexible efficient inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10452--10465, 2023

work page 2023