Motion-Compensated Weight Compression

Ismail Lamaakal

arxiv: 2605.24754 · v1 · pith:NARMOWT4new · submitted 2026-05-23 · 💻 cs.CV · cs.AI· cs.LG

Motion-Compensated Weight Compression

Ismail Lamaakal This is my paper

Pith reviewed 2026-06-30 12:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords weight compressionneural network compressionTransformer modelsentropy codingpermutation symmetrylayer alignmentmotion compensationrate-distortion optimization

0 comments

The pith

Aligning permutation-symmetric blocks across layers turns model depth into a predictable sequence for more efficient weight compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Motion-Compensated Weight Compression to address the storage bottleneck of neural network weights by exploiting cross-layer redundancies that arise from function-preserving symmetries. It aligns blocks such as hidden units and attention heads so that a lightweight layer-sequential predictor can operate on the resulting sequence, encoding only quantized prediction residuals with a learned entropy model under a rate-distortion objective. Periodic keyframes allow the decoder to reconstruct deployable weights through entropy decoding, dequantization, prediction, and inverse alignment. Experiments on Transformer language models and vision classifiers show improved rate-accuracy tradeoffs relative to independent quantization and other learned codecs, with competitive decode times. Ablations establish that alignment, prediction, entropy modeling, and keyframe scheduling are each required for the observed gains.

Core claim

Motion-Compensated Weight Compression aligns permutation-symmetric blocks across layers to maximize cross-layer correspondence, turns depth into a predictable sequence, applies a lightweight layer-sequential predictor with periodic keyframes, and encodes only the quantized prediction residuals using a learned entropy model trained under a rate-distortion objective. The decoder reconstructs the weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, producing weights ready for inference without retraining.

What carries the argument

Motion-Compensated Weight Compression (MCWC), which performs cross-layer alignment of permutation-symmetric blocks followed by layer-sequential residual prediction and entropy coding of the residuals.

Load-bearing premise

Permutation-symmetric blocks can be aligned reliably across layers to produce a sequence predictable enough for a lightweight predictor to outperform independent layer compression.

What would settle it

A controlled experiment that applies the full pipeline but replaces the learned alignment step with random or layer-independent permutations and measures whether the rate-accuracy Pareto gains over baselines disappear.

Figures

Figures reproduced from arXiv: 2605.24754 by Ismail Lamaakal.

**Figure 1.** Figure 1: Layers-as-video overview of MCWC. The encoder transforms depth into a predictable sequence by aligning functionally equivalent blocks via permutation (Π). Keyframes (e.g., L¯1) are coded absolutely, while subsequent P-frames are encoded as quantized residuals relative to predictions derived from previously decoded context (shown as the local decode loop). The decoder reconstructs aligned layers by addin… view at source ↗

**Figure 2.** Figure 2: Residual statistics before and after functional [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Rate–distortion curves for (a) language modeling and (b) vision [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized residual energy before and after functional alignment. Alignment consistently [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

**Figure 5.** Figure 5: Residual magnitude distributions for representative variants. Alignment shifts residuals [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗

**Figure 6.** Figure 6: Architectural scope of MCWC. The method is most directly applicable to repeated [PITH_FULL_IMAGE:figures/full_fig_p048_6.png] view at source ↗

read the original abstract

Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism-ail11/MCWC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCWC aligns permutation-symmetric blocks across layers then applies a lightweight sequential predictor plus residual entropy coding to exploit cross-layer redundancy in weight compression.

read the letter

The main takeaway is that this paper turns cross-layer permutation symmetries into a predictable sequence and compresses the aligned weights the way video codecs compress frames: predict the next layer, code only the quantized residual with a learned entropy model, and insert occasional keyframes. The pipeline is alignment, layer-sequential prediction, residual quantization, and entropy coding, followed by a simple inverse decoder that restores deployable weights.

What the work does cleanly is release code and run ablations that isolate each component. The abstract states that removing alignment, the predictor, the entropy model, or the keyframe schedule each hurts the rate-accuracy curve. That is concrete and reproducible. The claimed result is an improved Pareto frontier versus strong per-layer quantization and prior learned weight codecs on both language-modeling and vision-classification transformers, with competitive decode time.

The soft spot is that the alignment step is doing the heavy lifting, and we only have the abstract's word that it works reliably. If the permutation matching is brittle on new architectures or requires per-model tuning, the whole motion-compensation framing weakens. The quantitative gains, dataset details, and error bars are not visible here, so the size of the practical improvement is still unclear.

This is for engineers who need smaller stored weights for deployment rather than for theorists. A reader already working on model compression or efficient inference would get immediate value from the method description and the open repository.

I would send it to peer review. The approach is well-specified, the code is public, and the empirical claims are falsifiable. A referee can check whether the alignment holds up and whether the reported gains survive closer inspection of the tables.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (hidden units, attention heads) across layers to convert depth into a low-entropy sequence. A lightweight layer-sequential predictor with periodic keyframes then encodes only quantized residuals under a learned entropy model trained with a rate-distortion objective. The decoder performs entropy decoding, dequantization, predictor reconstruction, and inverse alignment to recover deployable weights. Experiments on Transformer language models and vision classifiers report improved rate-accuracy Pareto frontiers versus strong quantization and learned weight-codec baselines, with competitive decode latency; ablations confirm each component (alignment, prediction, entropy modeling, keyframing) is necessary. Code is released.

Significance. If the reported Pareto gains and ablation results hold under full scrutiny, the work provides a practical advance in exploiting function-preserving symmetries for neural weight compression, addressing a key deployment bottleneck for large Transformers. The combination of alignment, sequential prediction, and fast decoding distinguishes it from layer-independent codecs, and the open code release supports reproducibility.

minor comments (3)

[Abstract] Abstract: the claim of Pareto improvement would be strengthened by including one or two concrete numbers (e.g., bits-per-parameter reduction at iso-accuracy or accuracy delta at fixed rate) rather than qualitative language only.
[Method] The description of the alignment procedure and its inverse would benefit from an explicit equation or short pseudocode block showing how permutation matrices are computed and applied across layers.
[Experiments] Figure captions and axis labels in the rate-accuracy plots should explicitly state the datasets, model sizes, and baseline methods used so readers can immediately interpret the curves.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of MCWC and the recommendation for minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical weight codec that aligns permutation-symmetric blocks, applies a layer-sequential predictor with keyframes, and encodes quantized residuals under a rate-distortion objective. No derivation chain, equations, or claims are presented that reduce reported gains to quantities defined by the method's own fitted parameters or to self-citations. Ablations are cited as confirming component necessity, and results are shown via external benchmarks on Transformers. This is a standard empirical engineering contribution with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the learned entropy model and alignment procedure likely involve fitted components, but none can be identified from the given text.

pith-pipeline@v0.9.1-grok · 5709 in / 1159 out tokens · 20268 ms · 2026-06-30T12:55:56.674262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 43 canonical work pages · 27 internal anchors

[1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901. Association for Computational Linguistics, 2023

2023
[2]

K., Hayase, J., and Srinivasa, S

Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022. URL https://arxiv.org/abs/2209.04836

work page arXiv 2022
[3]

Variational image compression with a scale hyperprior

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Varia- tional image compression with a scale hyperprior. InInternational Conference on Learning Representations, 2018. URLhttps://arxiv.org/abs/1802.01436

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373, 2023. URL https:...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Universal deep neural network compression,

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Universal deep neural network compression,
[6]

URLhttps://arxiv.org/abs/1802.02271

work page internal anchor Pith review Pith/arXiv arXiv
[7]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URL https://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019

2019
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

URLhttps://arxiv.org/abs/2110.06296. 10

work page arXiv
[12]

Esser, Jeffrey L

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019. URLhttps://arxiv.org/abs/1902.08153

work page arXiv 1902
[13]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022
[14]

Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023. URL https://arxiv.org/abs/2301. 00774

work page arXiv 2023
[15]

Optimal brain compression: A framework for accurate post-training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2208.11580

work page arXiv 2022
[16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Serverlessllm: Low-latency serverless inference for large language models, 2024

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. Serverlessllm: Low-latency serverless inference for large language models, 2024. URLhttps://arxiv.org/abs/2401.14351

work page arXiv 2024
[18]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016. URLhttps://arxiv.org/abs/1510.00149

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Getting free bits back from rotational symmetries in llms, 2024

Jiajun He, Gergely Flamich, and José Miguel Hernández-Lobato. Getting free bits back from rotational symmetries in llms, 2024. URLhttps://arxiv.org/abs/2410.01309

work page arXiv 2024
[20]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations (ICLR),
[21]

URLhttps://arxiv.org/abs/1903.12261

work page internal anchor Pith review Pith/arXiv arXiv 1903
[22]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InIEEE/CVF International Conference on Computer Vision (ICCV),
[23]

URLhttps://arxiv.org/abs/1907.07174

work page arXiv 1907
[24]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. URL https: //arx...

work page arXiv 2021
[25]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. URL https://arxiv.org/abs/1712.05877

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/~kriz/cifar.html

2009
[29]

Tiny imagenet visual recognition challenge, 2015

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge, 2015. URL https: //cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf

2015
[30]

Compressing neural networks with inter prediction and linear transformation.IEEE Access, 9:69601–69608, 2021

Kang-Ho Lee and Sung-Ho Bae. Compressing neural networks with inter prediction and linear transformation.IEEE Access, 9:69601–69608, 2021

2021
[31]

Brecq: Pushing the limit of post-training quantization by block reconstruction,

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction,
[32]

URLhttps://arxiv.org/abs/2102.05426

work page arXiv
[33]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for LLM compression and acceleration.arXiv preprint arXiv:2306.00978, 2023. URLhttps://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[35]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

2021
[36]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

2022
[37]

Dvc: An end-to-end deep video compression framework, 2019

Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An end-to-end deep video compression framework, 2019. URL https://arxiv.org/abs/1812. 00101

2019
[38]

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, and Stamatios Lefkimmi- atis. Cospadi: Compressing llms via calibration-guided sparse dictionary learning, 2026. URL https://arxiv.org/abs/2509.22075

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,
[40]

URLhttps://aclanthology.org/J93-2004/

2004
[41]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Re- current neural network based language model

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, JanˇCernocký, and Sanjeev Khudanpur. Re- current neural network based language model. InProc. Interspeech, 2010. URL https: //www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html

2010
[43]

Joint Autoregressive and Hierarchical Priors for Learned Image Compression

David Minnen, Johannes Ballé, and George Toderici. Joint autoregressive and hierarchical priors for learned image compression, 2018. URLhttps://arxiv.org/abs/1809.02736

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016. URL https://arxiv. org/abs/1606.06031. 12

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Aminbeidokhti, Eric Granger, and Marco Pedersoli

Fidel A. Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Aminbeidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn differentiation, 2022. URL https://arxiv.org/abs/2212.12042

work page arXiv 2022
[47]

Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

2019
[48]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020
[49]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. URL https://arxiv.org/abs/ 1409.0575

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-015-0816-y 2015
[50]

Neural Weight Compression for Language Models

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, and Jaeho Lee. Neural weight compression for language models, 2026. URL https://arxiv.org/abs/ 2510.11234

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal, Gourav Datta, Prithwish Mukherjee, Sandeep P. Chinchali, and Michael Or- shansky. Entrollm: Entropy encoded weight compression for efficient large language model inference on edge devices, 2025. URLhttps://arxiv.org/abs/2505.02380

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[53]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URLhttps://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Talking-heads attention, 2020

Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, and Le Hou. Talking-heads attention, 2020. URLhttps://arxiv.org/abs/2003.02436

work page arXiv 2020
[55]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023. URLhttps://arxiv.org/abs/2303.06865

work page arXiv 2023
[56]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URLhttps://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. InAdvances in Neural Information Processing Systems, volume 34, pages 24261–24272, 2021

2021
[58]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021

2021
[59]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv. org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Deepcabac: A universal compression algorithm for deep neural networks

Simon Wiedemann, Heiner Kirchhoffer, Stefan Matlage, Paul Haase, Arturo Marban, Talmaj Marinc, David Neumann, Tung Nguyen, Heiko Schwarz, Thomas Wiegand, Detlev Marpe, and Wojciech Samek. Deepcabac: A universal compression algorithm for deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 14(4):700–714, May 2020. ISSN 1941-

2020
[62]

URL http://dx.doi.org/10.1109/JSTSP.2020

doi: 10.1109/jstsp.2020.2969554. URL http://dx.doi.org/10.1109/JSTSP.2020. 2969554

work page doi:10.1109/jstsp.2020.2969554 2020
[63]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023. URL https://arxiv.org/abs/2211.10438

work page arXiv 2023
[64]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,
[65]

URLhttps://arxiv.org/abs/2206.01861

work page arXiv
[66]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. URL https: //arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

+b ′ 2 =W 2 ϕ(W1x+b 1) +b 2.(29) Proof.Letu=W 1x+b 1 andu ′ =W ′ 1x+b ′
[68]

Screened greedy (Kcand=16) + refinement

Using (28), u′ = ΠW 1x+ Πb 1 = Π(W 1x+b 1) = Πu.(30) Because ϕ is coordinate-wise as in (27), permuting coordinates before applying ϕ permutes the outputs after applyingϕ: ϕ(u′) =ϕ(Πu) = Πϕ(u).(31) Substituting (31) into the output withW ′ 2 gives W ′ 2ϕ(u′) +b ′ 2 =W 2Π−1Πϕ(u) +b 2 =W 2ϕ(u) +b 2,(32) which proves (29). RemarkD.2 (Parameterized per-channe...

2048
[69]

One λ encode cost

and Penn Treebank [ 34, 36] to probe distribution shifts across corpora with different token statistics. For downstream generalization, zero-shot accuracy is reported on LAMBADA (last-word prediction) [ 39] and a small multi-task suite constructed from held-out validation splits, using the decoded weights without prompt tuning. Calibration sequences are a...

1940

[1] [1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901. Association for Computational Linguistics, 2023

2023

[2] [2]

K., Hayase, J., and Srinivasa, S

Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022. URL https://arxiv.org/abs/2209.04836

work page arXiv 2022

[3] [3]

Variational image compression with a scale hyperprior

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Varia- tional image compression with a scale hyperprior. InInternational Conference on Learning Representations, 2018. URLhttps://arxiv.org/abs/1802.01436

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373, 2023. URL https:...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Universal deep neural network compression,

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Universal deep neural network compression,

[6] [6]

URLhttps://arxiv.org/abs/1802.02271

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URL https://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019

2019

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [11]

URLhttps://arxiv.org/abs/2110.06296. 10

work page arXiv

[11] [12]

Esser, Jeffrey L

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019. URLhttps://arxiv.org/abs/1902.08153

work page arXiv 1902

[12] [13]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022

[13] [14]

Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023. URL https://arxiv.org/abs/2301. 00774

work page arXiv 2023

[14] [15]

Optimal brain compression: A framework for accurate post-training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2208.11580

work page arXiv 2022

[15] [16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [17]

Serverlessllm: Low-latency serverless inference for large language models, 2024

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. Serverlessllm: Low-latency serverless inference for large language models, 2024. URLhttps://arxiv.org/abs/2401.14351

work page arXiv 2024

[17] [18]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016. URLhttps://arxiv.org/abs/1510.00149

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [19]

Getting free bits back from rotational symmetries in llms, 2024

Jiajun He, Gergely Flamich, and José Miguel Hernández-Lobato. Getting free bits back from rotational symmetries in llms, 2024. URLhttps://arxiv.org/abs/2410.01309

work page arXiv 2024

[19] [20]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations (ICLR),

[20] [21]

URLhttps://arxiv.org/abs/1903.12261

work page internal anchor Pith review Pith/arXiv arXiv 1903

[21] [22]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InIEEE/CVF International Conference on Computer Vision (ICCV),

[22] [23]

URLhttps://arxiv.org/abs/1907.07174

work page arXiv 1907

[23] [24]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. URL https: //arx...

work page arXiv 2021

[24] [25]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. URL https://arxiv.org/abs/1712.05877

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [26]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [27]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [28]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/~kriz/cifar.html

2009

[28] [29]

Tiny imagenet visual recognition challenge, 2015

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge, 2015. URL https: //cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf

2015

[29] [30]

Compressing neural networks with inter prediction and linear transformation.IEEE Access, 9:69601–69608, 2021

Kang-Ho Lee and Sung-Ho Bae. Compressing neural networks with inter prediction and linear transformation.IEEE Access, 9:69601–69608, 2021

2021

[30] [31]

Brecq: Pushing the limit of post-training quantization by block reconstruction,

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction,

[31] [32]

URLhttps://arxiv.org/abs/2102.05426

work page arXiv

[32] [33]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for LLM compression and acceleration.arXiv preprint arXiv:2306.00978, 2023. URLhttps://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [34]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[34] [35]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

2021

[35] [36]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

2022

[36] [37]

Dvc: An end-to-end deep video compression framework, 2019

Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An end-to-end deep video compression framework, 2019. URL https://arxiv.org/abs/1812. 00101

2019

[37] [38]

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, and Stamatios Lefkimmi- atis. Cospadi: Compressing llms via calibration-guided sparse dictionary learning, 2026. URL https://arxiv.org/abs/2509.22075

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [39]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,

[39] [40]

URLhttps://aclanthology.org/J93-2004/

2004

[40] [41]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[41] [42]

Re- current neural network based language model

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, JanˇCernocký, and Sanjeev Khudanpur. Re- current neural network based language model. InProc. Interspeech, 2010. URL https: //www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html

2010

[42] [43]

Joint Autoregressive and Hierarchical Priors for Learned Image Compression

David Minnen, Johannes Ballé, and George Toderici. Joint autoregressive and hierarchical priors for learned image compression, 2018. URLhttps://arxiv.org/abs/1809.02736

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [45]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016. URL https://arxiv. org/abs/1606.06031. 12

work page internal anchor Pith review Pith/arXiv arXiv 2016

[45] [46]

Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Aminbeidokhti, Eric Granger, and Marco Pedersoli

Fidel A. Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Aminbeidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn differentiation, 2022. URL https://arxiv.org/abs/2212.12042

work page arXiv 2022

[46] [47]

Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

2019

[47] [48]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020

[48] [49]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. URL https://arxiv.org/abs/ 1409.0575

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-015-0816-y 2015

[49] [50]

Neural Weight Compression for Language Models

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, and Jaeho Lee. Neural weight compression for language models, 2026. URL https://arxiv.org/abs/ 2510.11234

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [51]

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal, Gourav Datta, Prithwish Mukherjee, Sandeep P. Chinchali, and Michael Or- shansky. Entrollm: Entropy encoded weight compression for efficient large language model inference on edge devices, 2025. URLhttps://arxiv.org/abs/2505.02380

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[52] [53]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URLhttps://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [54]

Talking-heads attention, 2020

Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, and Le Hou. Talking-heads attention, 2020. URLhttps://arxiv.org/abs/2003.02436

work page arXiv 2020

[54] [55]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023. URLhttps://arxiv.org/abs/2303.06865

work page arXiv 2023

[55] [56]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URLhttps://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [57]

Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. InAdvances in Neural Information Processing Systems, volume 34, pages 24261–24272, 2021

2021

[57] [58]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021

2021

[58] [59]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [60]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv. org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [61]

Deepcabac: A universal compression algorithm for deep neural networks

Simon Wiedemann, Heiner Kirchhoffer, Stefan Matlage, Paul Haase, Arturo Marban, Talmaj Marinc, David Neumann, Tung Nguyen, Heiko Schwarz, Thomas Wiegand, Detlev Marpe, and Wojciech Samek. Deepcabac: A universal compression algorithm for deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 14(4):700–714, May 2020. ISSN 1941-

2020

[61] [62]

URL http://dx.doi.org/10.1109/JSTSP.2020

doi: 10.1109/jstsp.2020.2969554. URL http://dx.doi.org/10.1109/JSTSP.2020. 2969554

work page doi:10.1109/jstsp.2020.2969554 2020

[62] [63]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023. URL https://arxiv.org/abs/2211.10438

work page arXiv 2023

[63] [64]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

[64] [65]

URLhttps://arxiv.org/abs/2206.01861

work page arXiv

[65] [66]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. URL https: //arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[66] [67]

+b ′ 2 =W 2 ϕ(W1x+b 1) +b 2.(29) Proof.Letu=W 1x+b 1 andu ′ =W ′ 1x+b ′

[67] [68]

Screened greedy (Kcand=16) + refinement

Using (28), u′ = ΠW 1x+ Πb 1 = Π(W 1x+b 1) = Πu.(30) Because ϕ is coordinate-wise as in (27), permuting coordinates before applying ϕ permutes the outputs after applyingϕ: ϕ(u′) =ϕ(Πu) = Πϕ(u).(31) Substituting (31) into the output withW ′ 2 gives W ′ 2ϕ(u′) +b ′ 2 =W 2Π−1Πϕ(u) +b 2 =W 2ϕ(u) +b 2,(32) which proves (29). RemarkD.2 (Parameterized per-channe...

2048

[68] [69]

One λ encode cost

and Penn Treebank [ 34, 36] to probe distribution shifts across corpora with different token statistics. For downstream generalization, zero-shot accuracy is reported on LAMBADA (last-word prediction) [ 39] and a small multi-task suite constructed from held-out validation splits, using the decoded weights without prompt tuning. Calibration sequences are a...

1940