Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Amir Hussein; Dinesh Manocha; Min Wu; Rayyan Abdalla

arxiv: 2606.05429 · v1 · pith:RF6OS6SWnew · submitted 2026-06-03 · 💻 cs.AI

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Rayyan Abdalla , Amir Hussein , Min Wu , Dinesh Manocha This is my paper

Pith reviewed 2026-06-28 06:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords post-training quantizationultra-low-bit quantizationlarge language modelsweight saliencygraph-guided groupingscaling overheadLLaMA models

0 comments

The pith

SAGE-PTQ cuts average LLM weight bits to 1.03 and scaling bits to 0.004 by separating salient weights via distributional statistics and modeling unsalient weights as a sparse graph to set group counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE-PTQ as a post-training quantization method that first uses distributional statistics to split weights into salient and unsalient subsets. It then represents the unsalient weights as a sparse graph to determine the best number of groups per layer, applies multi-bit precision only to salient weights and binarization to the rest, and uses one scale per channel for salient weights plus one scalar per unsalient group. Adaptive saliency thresholding chooses the split ratio per matrix. This combination is shown to reach the stated bit rates while delivering lower perplexity on LLaMA models than prior ultra-low-bit methods and lower memory use during inference.

Core claim

SAGE-PTQ separates salient and unsalient weights using distributional statistics, models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer, applies dual-mode quantization with multi-bit precision on salient weights and binarization on unsalient weights, employs one per-channel scale for salient weights and one scalar per unsalient group, and uses adaptive saliency thresholding; the result is an average of 1.03 weight bits and 0.004 scaling bits per matrix, with 6.74 WikiText2 perplexity on LLaMA-3-8B versus 55.8 for BiLLM and under 50 percent of BiLLM's GPU memory, plus 1.5x faster decoding on LLaMA-2-70B.

What carries the argument

Saliency-aware graph-guided group estimation that models unsalient weights as a sparse graph to determine per-layer group counts while keeping one scalar scale per group.

If this is right

Models quantized with SAGE-PTQ fit in less than half the GPU memory of BiLLM while maintaining usable perplexity.
Decoding speed on a single NVIDIA L40 GPU increases by 1.5x for 70B-scale models compared with prior ultra-low-bit methods.
The per-matrix adaptive threshold removes the need for manual saliency ratio tuning across layers.
Only one scale factor per unsalient group replaces the multiple scales required by earlier group-wise methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph-modeling step could be replaced by a cheaper heuristic if the optimal group count correlates strongly with simple statistics such as weight variance.
Extending the same separation logic to activation quantization might further reduce total inference cost without retraining.
The reported memory savings suggest the method could enable running 70B models on consumer GPUs that previously handled only 7B models.

Load-bearing premise

Distributional statistics can reliably identify which weights are salient versus unsalient and that a sparse graph on the unsalient subset yields group counts that preserve final model quality.

What would settle it

Running SAGE-PTQ on LLaMA-3-8B and measuring WikiText2 perplexity above 10 or average scaling bits above 0.01 would show the claimed bit rates and quality cannot be reached simultaneously.

Figures

Figures reproduced from arXiv: 2606.05429 by Amir Hussein, Dinesh Manocha, Min Wu, Rayyan Abdalla.

**Figure 2.** Figure 2: Memory footprint of SAGE-PTQ versus BiLLM [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation results of SAGE-PTQ on OPT 6.7B, LLaMA 7B, and DeepSeek 7B. Full pipeline achieves best perplexity. 1.3B 2.7B 6.7B 13B 30B 66B OPT Model Size 10 12 14 16 18 Perplexity Perplexity on Wikitext2 1.3B 2.7B 6.7B 13B 30B 66B OPT Model Size 12 14 16 18 20 Perplexity Perplexity on C4 NLUT = 3 NLUT = 4 NLUT = 5 NLUT = 6 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution of self-attention projection layer weights in OPT-6.7B across selected layers [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of self-attention projection layer weights in LLaMa-1 7B across Self-Attention [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Self-Attention Weight Distributions in DeepSeek-7B of selected layers (0,15,29): Non [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Extended evaluation of SAGE-PTQ versus BiLLM under a lookup table constraint of NLUT = 4 bits across multiple model families (OPT, LLaMA-1, LLaMA-2, and instruction-tuned Vicuna) ranging from 13B to 70B parameters. The bar plots report perplexity on WikiText2, PTB, and C4 datasets. SAGE-PTQ consistently outperforms BiLLM across all models, achieving an average precision of 1.03–1.07 bits per weight. 15 [P… view at source ↗

**Figure 9.** Figure 9: SAGE-PTQ Ablation study on saliency metrics across three model families (OPT-6.7B, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Perplexity impact of varying lookup table (LUT) sizes in [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE-PTQ combines distributional saliency splitting with sparse-graph group estimation to cut scaling bits in ultra-low PTQ, but the abstract gives no ablations or protocol so the headline numbers are hard to trust yet.

read the letter

The paper's core move is to split weights by distributional stats, treat the unsalient part as a sparse graph to pick per-layer group counts, then run dual-mode quantization with one scale per salient channel and one scalar per unsalient group, plus adaptive thresholding. That pipeline is presented as new and it directly targets the scaling overhead that shows up in BiLLM-style methods.

It does a clean job naming the practical cost (extra scales) and proposing a concrete fix that keeps average scaling bits at 0.004 while claiming 1.03 weight bits. The LLaMA-3-8B result (6.74 WikiText2 PPL vs 55.8 for BiLLM) and the 1.5x decode speedup on LLaMA-2-70B are the kind of numbers that would matter for memory-tight deployment if they hold.

The soft spot is exactly what the stress-test flags: the abstract supplies no ablation on whether the graph-derived group count matches what exhaustive search would pick, no stability check on the saliency threshold across layers or models, and no experimental protocol or error bars. Without those, the separation step and the graph step remain untested assumptions. The full text might contain the missing runs, but on the supplied material the central empirical claim rests on unshown evidence.

This is for groups already working on PTQ for LLMs who need sub-2-bit inference. A reader who wants to test graph-based group estimation would find the description useful to replicate, but anyone expecting verified numbers should wait for the details. It deserves a serious referee because the algorithmic construction is specific and the claimed gains are large enough to check.

Referee Report

2 major / 1 minor

Summary. The paper introduces SAGE-PTQ, a post-training quantization framework for LLMs that separates salient and unsalient weights using distributional statistics, models the unsalient subset as a sparse graph to estimate per-layer group counts, applies dual-mode quantization (multi-bit for salient weights, binary for unsalient), and reduces scaling overhead via one per-channel scale for salient weights and one scalar per unsalient group, plus adaptive saliency thresholding. It reports average 1.03 weight bits and 0.004 scaling bits per matrix, with LLaMA-3-8B achieving 6.74 WikiText2 perplexity (vs. 55.8 for BiLLM) and efficiency gains on LLaMA-2-70B.

Significance. If the empirical results and underlying assumptions hold under rigorous validation, the work would be significant for ultra-low-bit LLM quantization by directly targeting hidden scaling costs, a practical bottleneck in prior methods like BiLLM. The graph-guided group estimation offers a potentially parameter-light way to control overhead while preserving quality, which could influence deployment strategies if the separation and estimation steps prove robust across models.

major comments (2)

[Experimental Results] Experimental Results section: The headline claims (1.03 weight bits, 0.004 scaling bits, 6.74 WikiText2 PPL on LLaMA-3-8B vs. 55.8 for BiLLM) are presented without error bars, run counts, or explicit baseline re-implementation details and ablation tables; this directly affects verifiability of the dual-mode scheme's quality preservation at the reported bit rates.
[Method] Method section on sparse-graph group estimation: The central claim that distributional statistics cleanly partition weights and that the sparse-graph model on subsampled unsalient weights yields an accurate per-layer group count (enabling the 0.004 scaling-bit overhead) lacks any comparison to exhaustive search or ablation showing that the derived group numbers match those that would minimize quantization error without quality degradation.

minor comments (1)

[Method] The description of adaptive saliency thresholding would benefit from an explicit equation or pseudocode for how the optimal saliency ratio per matrix is selected, as the current prose leaves the decision criterion implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the verifiability of our empirical claims and the validation of the sparse-graph group estimation. We provide point-by-point responses to the major comments below, indicating where revisions will be made.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: The headline claims (1.03 weight bits, 0.004 scaling bits, 6.74 WikiText2 PPL on LLaMA-3-8B vs. 55.8 for BiLLM) are presented without error bars, run counts, or explicit baseline re-implementation details and ablation tables; this directly affects verifiability of the dual-mode scheme's quality preservation at the reported bit rates.

Authors: Post-training quantization is a deterministic process once calibration data and hyperparameters are fixed, which is standard in the field and explains the absence of error bars or multiple run counts. We agree, however, that explicit details on baseline re-implementations and additional ablation tables would strengthen verifiability. In the revised manuscript we will add a dedicated subsection describing how BiLLM and PB-LLM were re-implemented (including calibration settings and code availability) and include new ablation tables on the dual-mode quantization components. The reported numbers remain unchanged as they were obtained under the documented protocol. revision: partial
Referee: [Method] Method section on sparse-graph group estimation: The central claim that distributional statistics cleanly partition weights and that the sparse-graph model on subsampled unsalient weights yields an accurate per-layer group count (enabling the 0.004 scaling-bit overhead) lacks any comparison to exhaustive search or ablation showing that the derived group numbers match those that would minimize quantization error without quality degradation.

Authors: Exhaustive search over per-layer group counts is computationally prohibitive for models at the scale of LLaMA-70B. The sparse-graph model is presented as an efficient, structure-aware approximation rather than an exact optimizer. To address the concern we will add an ablation study on smaller models (LLaMA-7B and LLaMA-13B) that compares the group counts produced by the graph estimator against those obtained via grid search minimizing per-layer quantization error. The results of this comparison, together with the corresponding perplexity impact, will be reported in the revised Method and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an independent algorithmic construction

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed result to its own inputs by construction. Separation of salient/unsalient weights via distributional statistics, sparse-graph group estimation, dual-mode quantization, and adaptive thresholding are presented as procedural choices whose outputs are validated empirically on perplexity and bit-rate metrics. No fitted parameter is relabeled as a prediction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level algorithmic choices described.

pith-pipeline@v0.9.1-grok · 5824 in / 1256 out tokens · 26148 ms · 2026-06-28T06:07:32.916571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 27 canonical work pages · 13 internal anchors

[1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Kantharaj Dewan, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Sergey Bashlykov, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Deepseek-vl: Scaling vision-language models with decoupled pretraining

DeepSeek AI. Deepseek-vl: Scaling vision-language models with decoupled pretraining. https:// github.com/deepseek-ai, 2024. Accessed: 2024-07-25

2024
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Abq-llm: Arbitrary-bit quantized inference acceleration for large language models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22299–22307, 2025

2025
[10]

Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

Xing Liu, Zhenhua Zhang, Qinghao Ye, Yiren Lin, Xiangyu Zhang, and Jian Sun. Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

work page arXiv 2022
[11]

arXiv preprint arXiv:1909.10351 , year=

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351, 2019

work page arXiv 1909
[13]

Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

Zhuohan Ma and et al. Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

work page arXiv 2023
[14]

Knowledge distillation: A survey

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021

2021
[15]

Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

work page arXiv 2023
[16]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

Ziheng Xiao, Ren Yan, Shaohuai Wang, Zhe Wei, Ang Li, Mingyu Zhang, Xiaowei Li, and Yiying Wang. Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

work page arXiv 2023
[18]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024
[20]

Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023. 10

work page arXiv 2023
[21]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

work page arXiv 2024
[22]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

2020
[23]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021
[24]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

2022
[25]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

work page arXiv 2024
[26]

Sqpr: Stream query planning with reuse

Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. Sqpr: Stream query planning with reuse. In2011 IEEE 27th International Conference on Data Engineering, pages 840–851. IEEE, 2011

2011
[27]

arXiv preprint arXiv:2306.07629 , year=

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023
[28]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

work page arXiv 2023
[29]

Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

Reena Elangovan, Charbel Sakr, Anand Raghunathan, and Brucek Khailany. Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

work page arXiv 2025
[30]

Hawq: Hessian aware quantization of neural networks with mixed-precision

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF international conference on computer vision, pages 293–302, 2019

2019
[31]

Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

work page arXiv 2022
[32]

Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

2005
[33]

On spectral clustering: Analysis and an algorithm

Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14:849–856, 2002

2002
[34]

A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

Ulrike von Luxburg. A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

2007
[35]

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987

1987
[36]

Compressing deep convolutional networks using vector quantization

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. InInternational Conference on Learning Representations (ICLR), 2014

2014
[37]

Clip-q: Deep network compression learning by in-parallel pruning- quantization

Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruning- quantization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7873–7882, 2018

2018
[38]

Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models

Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, and Dinesh Manocha. Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models. arXiv preprint arXiv:2509.18763, 2025

work page arXiv 2025
[39]

Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, and Quan Bai. Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

2025
[40]

Courier Corporation, 2013

Richard P Brent.Algorithms for minimization without derivatives. Courier Corporation, 2013. 11

2013
[41]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[42]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[43]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024
[44]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023
[45]

Qwen2.5: A party of foundation models

Qwen Team. Qwen2.5: A party of foundation models. https://qwenlm.github.io/blog/qwen2.5/,
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[48]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020

2020
[49]

Piqa: Reasoning about physi- cal commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physi- cal commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

2020
[50]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark and Kenton Lee. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019

2019
[51]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

2018
[52]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2021

2021
[53]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Oren Etzioni. Think you have solved question answering? try arc, the ai2 reasoning challenge. InarXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019
[55]

The penn treebank: Annotating predicate argument structure

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. The penn treebank: Annotating predicate argument structure. InHuman Language Technology, 1994. 12 A Statistical Analysis of Weight Matrices and Outlier Pattern Detection Efficient post-training quantization for large language models (LLMs) demands careful modeling of weight matrix statisti...

1994

[1] [1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Kantharaj Dewan, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Sergey Bashlykov, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Deepseek-vl: Scaling vision-language models with decoupled pretraining

DeepSeek AI. Deepseek-vl: Scaling vision-language models with decoupled pretraining. https:// github.com/deepseek-ai, 2024. Accessed: 2024-07-25

2024

[6] [6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Abq-llm: Arbitrary-bit quantized inference acceleration for large language models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22299–22307, 2025

2025

[10] [10]

Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

Xing Liu, Zhenhua Zhang, Qinghao Ye, Yiren Lin, Xiangyu Zhang, and Jian Sun. Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

work page arXiv 2022

[11] [11]

arXiv preprint arXiv:1909.10351 , year=

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351, 2019

work page arXiv 1909

[12] [13]

Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

Zhuohan Ma and et al. Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

work page arXiv 2023

[13] [14]

Knowledge distillation: A survey

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021

2021

[14] [15]

Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

work page arXiv 2023

[15] [16]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [17]

Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

Ziheng Xiao, Ren Yan, Shaohuai Wang, Zhe Wei, Ang Li, Mingyu Zhang, Xiaowei Li, and Yiying Wang. Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

work page arXiv 2023

[17] [18]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024

[19] [20]

Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023. 10

work page arXiv 2023

[20] [21]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

work page arXiv 2024

[21] [22]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

2020

[22] [23]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021

[23] [24]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

2022

[24] [25]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

work page arXiv 2024

[25] [26]

Sqpr: Stream query planning with reuse

Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. Sqpr: Stream query planning with reuse. In2011 IEEE 27th International Conference on Data Engineering, pages 840–851. IEEE, 2011

2011

[26] [27]

arXiv preprint arXiv:2306.07629 , year=

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023

[27] [28]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

work page arXiv 2023

[28] [29]

Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

Reena Elangovan, Charbel Sakr, Anand Raghunathan, and Brucek Khailany. Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

work page arXiv 2025

[29] [30]

Hawq: Hessian aware quantization of neural networks with mixed-precision

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF international conference on computer vision, pages 293–302, 2019

2019

[30] [31]

Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

work page arXiv 2022

[31] [32]

Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

2005

[32] [33]

On spectral clustering: Analysis and an algorithm

Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14:849–856, 2002

2002

[33] [34]

A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

Ulrike von Luxburg. A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

2007

[34] [35]

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987

1987

[35] [36]

Compressing deep convolutional networks using vector quantization

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. InInternational Conference on Learning Representations (ICLR), 2014

2014

[36] [37]

Clip-q: Deep network compression learning by in-parallel pruning- quantization

Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruning- quantization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7873–7882, 2018

2018

[37] [38]

Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models

Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, and Dinesh Manocha. Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models. arXiv preprint arXiv:2509.18763, 2025

work page arXiv 2025

[38] [39]

Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, and Quan Bai. Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

2025

[39] [40]

Courier Corporation, 2013

Richard P Brent.Algorithms for minimization without derivatives. Courier Corporation, 2013. 11

2013

[40] [41]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019

[41] [42]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[42] [43]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024

[43] [44]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023

[44] [45]

Qwen2.5: A party of foundation models

Qwen Team. Qwen2.5: A party of foundation models. https://qwenlm.github.io/blog/qwen2.5/,

[45] [46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[47] [48]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020

2020

[48] [49]

Piqa: Reasoning about physi- cal commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physi- cal commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

2020

[49] [50]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark and Kenton Lee. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019

2019

[50] [51]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

2018

[51] [52]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2021

2021

[52] [53]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Oren Etzioni. Think you have solved question answering? try arc, the ai2 reasoning challenge. InarXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[53] [54]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019

[54] [55]

The penn treebank: Annotating predicate argument structure

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. The penn treebank: Annotating predicate argument structure. InHuman Language Technology, 1994. 12 A Statistical Analysis of Weight Matrices and Outlier Pattern Detection Efficient post-training quantization for large language models (LLMs) demands careful modeling of weight matrix statisti...

1994