TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

Dawei Yang; Xing Hu; Zhe Jiang; Zhixiong Zhao; Zhixuan Chen; Zukang Xu

arxiv: 2606.13054 · v1 · pith:YSYCUIVAnew · submitted 2026-06-11 · 💻 cs.LG · cs.AI

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

Zhixiong Zhao , Zukang Xu , Zhixuan Chen , Xing Hu , Zhe Jiang , Dawei Yang This is my paper

Pith reviewed 2026-06-27 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords post-training quantizationternary weightslow-bit activationslarge language modelsmodel compressioninference acceleration

0 comments

The pith

TWLA is a post-training method that achieves 1.58-bit ternary weights and 4-bit activations for large language models while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models face deployment barriers from high memory and compute demands. The paper presents TWLA as a post-training quantization approach that reaches 1.58-bit weights and 4-bit activations. It does so through three linked components that handle ternarization error, reshape distributions via rotation, and allocate activation bits across layers with interaction costs. A sympathetic reader would care because this combination promises both smaller model footprints and end-to-end inference speedups that prior ternary methods could not deliver without keeping activations at higher precision.

Core claim

The paper claims that its TWLA framework, built from Euclidean-to-Manifold Asymmetric Ternary Quantizer for layer-output error minimization, Kronecker Orthogonal Tri-Modal Shaping for tri-modal weight reshaping and outlier suppression, and Inter-Layer Aware Activation Mixed Precision for joint bit allocation, enables W1.58A4 quantization of LLMs that maintains high accuracy on language tasks and yields inference acceleration.

What carries the argument

The Kronecker Orthogonal Tri-Modal Shaping component, which applies a Kronecker-structured orthogonal rotation to weights to produce ternary-friendly tri-modal distributions while statistically suppressing activation outliers.

If this is right

Model memory footprint shrinks substantially from 1.58-bit weight storage.
Inference compute accelerates because both weights and activations use low-bit arithmetic.
End-to-end low-precision execution becomes feasible without retaining high-precision activations.
Accuracy remains competitive with full-precision baselines on language modeling and downstream tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The outlier-suppression effect of the shared rotation might extend to other low-bit activation schemes beyond 4 bits.
The inter-layer cost modeling could be adapted to mixed-precision schemes that also vary weight bits.
Hardware implementations could exploit the fixed rotation matrix for further speed gains on specific accelerators.

Load-bearing premise

The assumption that the shared orthogonal rotation sufficiently suppresses activation outliers to support reliable 4-bit quantization without major accuracy loss.

What would settle it

A clear accuracy or perplexity drop on a standard LLM benchmark when the full TWLA pipeline is applied with 4-bit activations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.13054 by Dawei Yang, Xing Hu, Zhe Jiang, Zhixiong Zhao, Zhixuan Chen, Zukang Xu.

**Figure 1.** Figure 1: LLaMA-family performance on seven zero-shot tasks. TWLA remains robust under both weight-only and weight–activation quantization (at equal memory cost), while other methods degrade substantially with 4-bit activation quantization. 1. Introduction In recent years, Large Language Models (LLMs) have achieved outstanding results across a wide range of diverse domains (Zhu et al., 2025). However, this strong ca… view at source ↗

**Figure 2.** Figure 2: (a) Before applying TWLA, activation outliers hinder low-bit quantization and unimodal weights are misaligned with ternary codebooks. (b) After applying TWLA, activations are smoothed and weights are transformed into symmetric tri-modal forms. additions and simple branching to lower inference costs, while offering stronger representational power and a superior accuracy-efficiency trade-off. Recent studies… view at source ↗

**Figure 3.** Figure 3: Overview of TWLA (initialization follows the same design as PT2 -LLM (Yan et al., 2025)). Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ): minimizes layer-output error under weight ternarization via a two-stage optimization. Kronecker Orthogonal Tri-Modal Shaping (KOTMS): reshapes weight into ternary-friendly tri-modal distribution. Inter-Layer Aware Activation Mixed Precision (ILA-AMP): preve… view at source ↗

**Figure 4.** Figure 4: (a) Ablation of the Overhead–Performance trade-off for KOTMS. (b) Comparison of throughput and memory consumption across FP16, QuaRot, and TWLA. (c) Comparison of quantization time and perplexity across QuaRot, SliM-LLM, and TWLA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Perplexity of LLaMA3-8B (W1.58A4) using calibration data sampled with different number or seeds from WikiText2. Summary. Overall, the expanded tables consistently show that TWLA offers superior robustness and scalability as activation precision decreases from 16 to 6 and further to 4 bits. While existing PTQ methods rapidly lose accuracy (and often collapse) under low-bit activations, TWLA maintains strong… view at source ↗

**Figure 6.** Figure 6: Calibration cost versus quantization quality under different interaction orders. Bars report the wall-clock time (minutes) spent collecting validation NLL statistics for 1st-, 2nd-, and 3rd-order interaction modeling, and dots report WikiText2 perplexity (lower is better). variant (ILA-AMP) additionally models adjacent-layer pairwise couplings to capture error propagation across neighboring layers. The 3rd… view at source ↗

**Figure 7.** Figure 7: Layer-wise activation bit allocation produced by ILA-AMP under an average activation bit budget of 4 on the WikiText2 calibration set for LLaMA2-7B and LLaMA3-8B. Numbers above bars denote the assigned per-layer activation bit-width. and output embeddings: larger input–output changes indicate higher importance, and the metric requires calibration data. • ZD (Z-score Distribution) quantifies importance by t… view at source ↗

**Figure 8.** Figure 8: The weight distribution of the 12th layer in Qwen3-8B before and after TWLA. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: The activation distribution of the 12th layer in Qwen3-8B before and after TWLA. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Cross-layer heterogeneity of activation outlier suppression after KOTMS on QWEN3-8B. Quantile plots (Min/Max, 1/99%, and 25/75%) of the Q-projection input activations are shown for Layers 4, 12, 24, and 36, comparing the original activations (left) with those after KOTMS projection (right). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

read the original abstract

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at <https://github.com/Kishon-zzx/TWLA>.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TWLA's KOTMS claims a shared Kronecker rotation statistically suppresses activation outliers enough for reliable 4-bit quantization, but the abstract supplies no bounds, moments analysis, or isolating ablations to back that up.

read the letter

The paper puts forward TWLA, a post-training framework that targets 1.58-bit weights and 4-bit activations together. It has three named pieces: E2M-ATQ does a two-stage optimization to minimize output error under ternarization, KOTMS applies a Kronecker-structured orthogonal rotation to push weights into a tri-modal shape while claiming the same rotation reduces activation dynamic range, and ILA-AMP adds inter-layer second-order costs when choosing per-layer activation bits.

The combination is new on the surface. Most prior ternarization work keeps activations at higher precision; here the authors try to close that gap with one shared transform plus layer-aware allocation. Releasing code is also useful.

The central soft spot is the KOTMS outlier claim. The abstract says the rotation "statistically suppresses" outliers because it is shared with the weight shaping, yet gives no reduction factor on the tail, no derivation linking the tri-modal constraint to activation moments, and no ablation that holds the other two components fixed. Without that, it is unclear whether the 4-bit activation result actually comes from the rotation or from ILA-AMP and extra tuning. The experiments are described only as "maintaining high accuracy," with no numbers or baseline comparisons visible in the abstract.

This is for readers already working on LLM quantization who want to see whether the shared-rotation idea can be made to work. A serious referee could check the missing analysis and the full tables. I would send it to review rather than desk-reject; the framing is distinct enough to be worth testing even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces TWLA, a post-training quantization framework for LLMs that achieves 1.58-bit ternary weights and 4-bit activations while preserving high accuracy. It consists of three components: E2M-ATQ (a two-stage Euclidean-to-manifold optimization for layer-output error minimization under ternarization), KOTMS (Kronecker-structured orthogonal rotation to produce tri-modal weight distributions that also statistically suppresses activation outliers), and ILA-AMP (inter-layer aware mixed-precision activation bit allocation that accounts for second-order interactions and shared-transform effects). Experiments are claimed to demonstrate maintained accuracy under W1.58A4 with inference acceleration; code is released.

Significance. If the accuracy claims under W1.58A4 hold with rigorous verification, the work would advance practical end-to-end low-bit LLM inference by addressing the activation outlier barrier that has limited prior ternarization methods. Code availability strengthens reproducibility.

major comments (2)

[KOTMS] KOTMS section: the claim that the shared Kronecker orthogonal rotation 'statistically suppresses activation outliers' sufficiently for reliable 4-bit quantization lacks any bound on the dynamic-range reduction factor, any analysis of interaction between the tri-modal weight constraint and activation moments, and any ablation isolating this effect from ILA-AMP. This is load-bearing for the central W1.58A4 accuracy-maintenance claim.
[Experiments] Experiments section: without reported ablations or quantitative measurements (e.g., 99.9th-percentile reduction factor) showing that outlier suppression reaches the ~4-8x level typically required to avoid clipping loss in 4-bit quantization, the accuracy results cannot be attributed to KOTMS rather than the other components or post-hoc tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the KOTMS component and its experimental validation. We address the two major comments below and commit to revisions that strengthen the manuscript with the requested analyses and measurements.

read point-by-point responses

Referee: [KOTMS] KOTMS section: the claim that the shared Kronecker orthogonal rotation 'statistically suppresses activation outliers' sufficiently for reliable 4-bit quantization lacks any bound on the dynamic-range reduction factor, any analysis of interaction between the tri-modal weight constraint and activation moments, and any ablation isolating this effect from ILA-AMP. This is load-bearing for the central W1.58A4 accuracy-maintenance claim.

Authors: We agree that additional analysis would strengthen the presentation. In the revision we will derive a bound on the dynamic-range reduction factor induced by the Kronecker-structured orthogonal rotation, provide a short analysis of the interaction between the resulting tri-modal weight distribution and the first two moments of the activations, and add an ablation that isolates KOTMS from ILA-AMP by comparing W1.58A4 accuracy with and without the shared rotation while keeping all other components fixed. revision: yes
Referee: [Experiments] Experiments section: without reported ablations or quantitative measurements (e.g., 99.9th-percentile reduction factor) showing that outlier suppression reaches the ~4-8x level typically required to avoid clipping loss in 4-bit quantization, the accuracy results cannot be attributed to KOTMS rather than the other components or post-hoc tuning.

Authors: We will augment the experiments section with the requested quantitative measurements, specifically reporting the 99.9th-percentile activation-range reduction factor achieved by KOTMS across the evaluated models. We will also include the ablation results described above, which directly quantify the outlier-suppression contribution of KOTMS and confirm that the observed reduction is sufficient to support reliable 4-bit activation quantization. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on novel components with external code

full rationale

The paper presents TWLA as a new three-component PTQ framework (E2M-ATQ, KOTMS, ILA-AMP) whose central claims about W1.58A4 accuracy are tied to the proposed algorithms and their statistical effects rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. The abstract explicitly positions the outlier suppression as a consequence of the shared Kronecker rotation (a design choice, not a tautology) and makes the implementation available at an external GitHub link, allowing independent verification. No equations or prior-author citations are invoked in a way that reduces the derivation to its inputs by construction. This is the normal case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the contributions are algorithmic components rather than new physical or mathematical entities.

pith-pipeline@v0.9.1-grok · 5821 in / 1131 out tokens · 33728 ms · 2026-06-27T07:36:33.148346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages

[1]

Ashkboos, S., Mohtashami, A., Croci, M

doi: 10.1109/IJCNN.2017.7966166. Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in ro- tated llms.Advances in Neural Information Processing Systems, 37:100213–100240,

work page doi:10.1109/ijcnn.2017.7966166 2017
[2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H

URL https://arxiv.org/abs/ 1911.11641. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., S...

Pith/arXiv arXiv 1911
[3]

Chen, T., Li, Z., Xu, W., Zhu, Z., Li, D., Tian, L., Barsoum, E., Wang, P., and Cheng, J

URL https://arxiv.org/abs/ 2107.03374. Chen, T., Li, Z., Xu, W., Zhu, Z., Li, D., Tian, L., Barsoum, E., Wang, P., and Cheng, J. Ternaryllm: Ternarized large language model,

Pith/arXiv arXiv
[4]

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O

URL https://arxiv.org/ abs/2406.07177. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

arXiv
[5]

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J

URL https://arxiv.org/abs/ 1803.05457. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems,

Pith/arXiv arXiv
[6]

org/abs/2110.14168

URL https://arxiv. org/abs/2110.14168. Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression,

Pith/arXiv arXiv
[7]

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D

URL https://arxiv.org/abs/2306.03078. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers,

arXiv
[8]

org/abs/2210.17323

URL https://arxiv. org/abs/2210.17323. Gu, H., Hu, L., Yu, L., Li, H., and Liu, F. Lopro: Enhanc- ing low-rank quantization via permuted block-wise rota- tion,

Pith/arXiv arXiv
[9]

doi: 10.1145/3649329. 3658498. URL http://dx.doi.org/10.1145/ 3649329.3658498. Gul, H., Hu, L., Niu, S., and Liu, F. Flrq: Faster llm quan- tization with flexible low-rank matrix sketching,

work page doi:10.1145/3649329
[10]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z

URLhttps://arxiv.org/abs/2601.05684. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, ...

arXiv
[11]

Nature645,633–638

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding,

work page doi:10.1038/s41586-025-09422-z
[12]

Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X

URL https: //arxiv.org/abs/2009.03300. Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X. Slim-llm: Salience- driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917,

Pith/arXiv arXiv 2009
[13]

Li, M., Zhang, D., He, T., Xie, X., Li, Y .-F., and Qin, K

URL https://arxiv.org/ abs/1605.04711. Li, M., Zhang, D., He, T., Xie, X., Li, Y .-F., and Qin, K. Towards effective data-free knowledge distillation via diverse diffusion augmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 4416–4425, 2024a. Li, M., Zhang, D., Dong, Q., Xie, X., and Qin, K. Adaptive dataset quantization...

arXiv
[14]

Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129, 2024b

Li, Z., Yan, X., Zhang, T., Qin, H., Xie, D., Tian, J., Kong, L., Zhang, Y ., Yang, X., et al. Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129, 2024b. Liu, K., Yang, K., Chen, Z., Li, Z., Guo, Y ., Li, W., Kong, L., and Zhang, Y . Bimacosr: Binary one-step diffusion model leveraging flexible matrix comp...

arXiv
[15]

Merity, S., Xiong, C., Bradbury, J., and Socher, R

URL https://arxiv.org/ abs/2502.00333. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

arXiv
[16]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P

URL https: //arxiv.org/abs/1606.06031. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Explor- ing the limits of transfer learning with a unified text-to- text transformer,

Pith/arXiv arXiv
[17]

Sakaguchi, K., Bras, R

URL https://arxiv.org/ abs/1910.10683. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale,

Pith/arXiv arXiv 1910
[18]

Saxena, U., Sharify, S., Roy, K., and Wang, X

URL https://arxiv.org/abs/ 1907.10641. Saxena, U., Sharify, S., Roy, K., and Wang, X. Resq: Mixed- precision quantization of large language models with low- rank residuals.arXiv preprint arXiv:2412.14363,

Pith/arXiv arXiv 1907
[19]

URL https: //arxiv.org/abs/2310.00034. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

arXiv
[20]

Wang, H., Ma, S., and Wei, F

URL https://arxiv.org/ abs/2302.13971. Wang, H., Ma, S., and Wei, F. Bitnet v2: Native 4-bit activa- tions with hadamard transformation for 1-bit llms,

Pith/arXiv arXiv
[21]

Wang, P., Hu, Q., Zhang, Y ., Zhang, C., Liu, Y ., and Cheng, J

URLhttps://arxiv.org/abs/2504.18415. Wang, P., Hu, Q., Zhang, Y ., Zhang, C., Liu, Y ., and Cheng, J. Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June

arXiv
[22]

11 TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization Xu, Z., Zhao, Z., Hu, X., Chen, Z., and Yang, D

URLhttps://arxiv.org/abs/2511.22316. 11 TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization Xu, Z., Zhao, Z., Hu, X., Chen, Z., and Yang, D. Kbvq- moe: Klt-guided svd with bias-corrected vector quanti- zation for moe large language models.arXiv preprint arXiv:2602.11184,

Pith/arXiv arXiv
[23]

URL https://arxiv.org/abs/2510.03267. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M...

arXiv
[24]

Zhang, W., Hou, L., Yin, Y ., Shang, L., Chen, X., Jiang, X., and Liu, Q

URL https://arxiv.org/abs/ 1905.07830. Zhang, W., Hou, L., Yin, Y ., Shang, L., Chen, X., Jiang, X., and Liu, Q. Ternarybert: Distillation-aware ultra-low bit bert,

Pith/arXiv arXiv 1905
[25]

URL https://arxiv.org/abs/2009. 12812. Zhao, Z., Li, H., Liu, F., Lu, Y ., Wang, Z., Yang, T., Jiang, L., and Guan, H. Quark: Quantization-enabled circuit sharing for transformer acceleration by exploiting com- mon patterns in nonlinear operations. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–9. IEEE,

2009
[26]

Specquant: Spectral decomposition and adaptive truncation for ultra-low-bit llms quantization

Zhao, Z., Liu, F., Wang, J., Guan, C., Wang, Z., Jiang, L., and Guan, H. Specquant: Spectral decomposition and adaptive truncation for ultra-low-bit llms quantization. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 40, pp. 28786–28794, 2026a. Zhao, Z., Xu, Z., and Yang, D. Bwla: Breaking the barrier of w1ax post-training quanti...

Pith/arXiv arXiv
[27]

org/abs/1612.01064

URL https://arxiv. org/abs/1612.01064. Zhu, C., Lin, Y ., Shao, J., Lin, J., and Wang, Y . Pathology- aware prototype evolution via llm-driven semantic disam- biguation for multicenter diabetic retinopathy diagnosis. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pp. 9196–9205,

Pith/arXiv arXiv
[28]

The storage reduction factor is therefore m2 n2 1 +n 2 2 = 16,777,216 8192 = 2048.(57) Thus, the Kronecker parameterization is2048×smaller in storage than a dense4096×4096transform

yields only n2 1 +n 2 2 = 642 + 642 = 8192(56) entries. The storage reduction factor is therefore m2 n2 1 +n 2 2 = 16,777,216 8192 = 2048.(57) Thus, the Kronecker parameterization is2048×smaller in storage than a dense4096×4096transform. B.3.3. FAST APPLICATION VIA RESHAPE–MULTIPLY–VECTORIZE Let v∈R 1×m. Reshape it into a matrix V∈R n1×n2 such that vec(V)...

2048

[1] [1]

Ashkboos, S., Mohtashami, A., Croci, M

doi: 10.1109/IJCNN.2017.7966166. Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in ro- tated llms.Advances in Neural Information Processing Systems, 37:100213–100240,

work page doi:10.1109/ijcnn.2017.7966166 2017

[2] [2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H

URL https://arxiv.org/abs/ 1911.11641. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., S...

Pith/arXiv arXiv 1911

[3] [3]

Chen, T., Li, Z., Xu, W., Zhu, Z., Li, D., Tian, L., Barsoum, E., Wang, P., and Cheng, J

URL https://arxiv.org/abs/ 2107.03374. Chen, T., Li, Z., Xu, W., Zhu, Z., Li, D., Tian, L., Barsoum, E., Wang, P., and Cheng, J. Ternaryllm: Ternarized large language model,

Pith/arXiv arXiv

[4] [4]

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O

URL https://arxiv.org/ abs/2406.07177. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

arXiv

[5] [5]

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J

URL https://arxiv.org/abs/ 1803.05457. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems,

Pith/arXiv arXiv

[6] [6]

org/abs/2110.14168

URL https://arxiv. org/abs/2110.14168. Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression,

Pith/arXiv arXiv

[7] [7]

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D

URL https://arxiv.org/abs/2306.03078. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers,

arXiv

[8] [8]

org/abs/2210.17323

URL https://arxiv. org/abs/2210.17323. Gu, H., Hu, L., Yu, L., Li, H., and Liu, F. Lopro: Enhanc- ing low-rank quantization via permuted block-wise rota- tion,

Pith/arXiv arXiv

[9] [9]

doi: 10.1145/3649329. 3658498. URL http://dx.doi.org/10.1145/ 3649329.3658498. Gul, H., Hu, L., Niu, S., and Liu, F. Flrq: Faster llm quan- tization with flexible low-rank matrix sketching,

work page doi:10.1145/3649329

[10] [10]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z

URLhttps://arxiv.org/abs/2601.05684. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, ...

arXiv

[11] [11]

Nature645,633–638

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding,

work page doi:10.1038/s41586-025-09422-z

[12] [12]

Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X

URL https: //arxiv.org/abs/2009.03300. Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X. Slim-llm: Salience- driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917,

Pith/arXiv arXiv 2009

[13] [13]

Li, M., Zhang, D., He, T., Xie, X., Li, Y .-F., and Qin, K

URL https://arxiv.org/ abs/1605.04711. Li, M., Zhang, D., He, T., Xie, X., Li, Y .-F., and Qin, K. Towards effective data-free knowledge distillation via diverse diffusion augmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 4416–4425, 2024a. Li, M., Zhang, D., Dong, Q., Xie, X., and Qin, K. Adaptive dataset quantization...

arXiv

[14] [14]

Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129, 2024b

Li, Z., Yan, X., Zhang, T., Qin, H., Xie, D., Tian, J., Kong, L., Zhang, Y ., Yang, X., et al. Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129, 2024b. Liu, K., Yang, K., Chen, Z., Li, Z., Guo, Y ., Li, W., Kong, L., and Zhang, Y . Bimacosr: Binary one-step diffusion model leveraging flexible matrix comp...

arXiv

[15] [15]

Merity, S., Xiong, C., Bradbury, J., and Socher, R

URL https://arxiv.org/ abs/2502.00333. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

arXiv

[16] [16]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P

URL https: //arxiv.org/abs/1606.06031. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Explor- ing the limits of transfer learning with a unified text-to- text transformer,

Pith/arXiv arXiv

[17] [17]

Sakaguchi, K., Bras, R

URL https://arxiv.org/ abs/1910.10683. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale,

Pith/arXiv arXiv 1910

[18] [18]

Saxena, U., Sharify, S., Roy, K., and Wang, X

URL https://arxiv.org/abs/ 1907.10641. Saxena, U., Sharify, S., Roy, K., and Wang, X. Resq: Mixed- precision quantization of large language models with low- rank residuals.arXiv preprint arXiv:2412.14363,

Pith/arXiv arXiv 1907

[19] [19]

URL https: //arxiv.org/abs/2310.00034. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

arXiv

[20] [20]

Wang, H., Ma, S., and Wei, F

URL https://arxiv.org/ abs/2302.13971. Wang, H., Ma, S., and Wei, F. Bitnet v2: Native 4-bit activa- tions with hadamard transformation for 1-bit llms,

Pith/arXiv arXiv

[21] [21]

Wang, P., Hu, Q., Zhang, Y ., Zhang, C., Liu, Y ., and Cheng, J

URLhttps://arxiv.org/abs/2504.18415. Wang, P., Hu, Q., Zhang, Y ., Zhang, C., Liu, Y ., and Cheng, J. Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June

arXiv

[22] [22]

11 TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization Xu, Z., Zhao, Z., Hu, X., Chen, Z., and Yang, D

URLhttps://arxiv.org/abs/2511.22316. 11 TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization Xu, Z., Zhao, Z., Hu, X., Chen, Z., and Yang, D. Kbvq- moe: Klt-guided svd with bias-corrected vector quanti- zation for moe large language models.arXiv preprint arXiv:2602.11184,

Pith/arXiv arXiv

[23] [23]

URL https://arxiv.org/abs/2510.03267. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M...

arXiv

[24] [24]

Zhang, W., Hou, L., Yin, Y ., Shang, L., Chen, X., Jiang, X., and Liu, Q

URL https://arxiv.org/abs/ 1905.07830. Zhang, W., Hou, L., Yin, Y ., Shang, L., Chen, X., Jiang, X., and Liu, Q. Ternarybert: Distillation-aware ultra-low bit bert,

Pith/arXiv arXiv 1905

[25] [25]

URL https://arxiv.org/abs/2009. 12812. Zhao, Z., Li, H., Liu, F., Lu, Y ., Wang, Z., Yang, T., Jiang, L., and Guan, H. Quark: Quantization-enabled circuit sharing for transformer acceleration by exploiting com- mon patterns in nonlinear operations. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–9. IEEE,

2009

[26] [26]

Specquant: Spectral decomposition and adaptive truncation for ultra-low-bit llms quantization

Zhao, Z., Liu, F., Wang, J., Guan, C., Wang, Z., Jiang, L., and Guan, H. Specquant: Spectral decomposition and adaptive truncation for ultra-low-bit llms quantization. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 40, pp. 28786–28794, 2026a. Zhao, Z., Xu, Z., and Yang, D. Bwla: Breaking the barrier of w1ax post-training quanti...

Pith/arXiv arXiv

[27] [27]

org/abs/1612.01064

URL https://arxiv. org/abs/1612.01064. Zhu, C., Lin, Y ., Shao, J., Lin, J., and Wang, Y . Pathology- aware prototype evolution via llm-driven semantic disam- biguation for multicenter diabetic retinopathy diagnosis. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pp. 9196–9205,

Pith/arXiv arXiv

[28] [28]

The storage reduction factor is therefore m2 n2 1 +n 2 2 = 16,777,216 8192 = 2048.(57) Thus, the Kronecker parameterization is2048×smaller in storage than a dense4096×4096transform

yields only n2 1 +n 2 2 = 642 + 642 = 8192(56) entries. The storage reduction factor is therefore m2 n2 1 +n 2 2 = 16,777,216 8192 = 2048.(57) Thus, the Kronecker parameterization is2048×smaller in storage than a dense4096×4096transform. B.3.3. FAST APPLICATION VIA RESHAPE–MULTIPLY–VECTORIZE Let v∈R 1×m. Reshape it into a matrix V∈R n1×n2 such that vec(V)...

2048