BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Chaofan Tao; He Xiao; Hongxia Yang; Jing Xiong; Jungang Li; Junyu Chen; Long Shi; Mengzhao Chen; Ngai Wong; Qingyao Yang

arxiv: 2602.04163 · v2 · pith:HVAIRJ6Jnew · submitted 2026-02-04 · 💻 cs.LG

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen , Jungang Li , Jing Xiong , Wenjie Wang , Qingyao Yang , He Xiao , Zhen Li , Taiqiang Wu

show 6 more authors

Mengzhao Chen Zhen Peng Chaofan Tao Long Shi Hongxia Yang Ngai Wong

This is my paper

Pith reviewed 2026-05-21 14:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationlarge language modelsbit-plane decompositionvariable quantization gridlow-bit inferenceHessian-based optimization

0 comments

The pith

BPDQ builds a variable grid from bit-planes to make 2-bit quantization work for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Bit-Plane Decomposition Quantization to overcome the restrictions of fixed grids in low-bit post-training quantization. Fixed uniform intervals limit how much error can be reduced at 2 or 3 bits. BPDQ instead forms a variable grid by breaking weights into bit-planes plus scalar coefficients and refines them step by step with second-order information to cut output mismatch. This change lets a 72-billion-parameter model run at 2 bits on a single consumer GPU while keeping GSM8K accuracy at 83.85 percent against 90.83 percent at full precision. The authors also show mathematically that the variable grid widens the set of usable quantizations and keeps the updates aligned with the Hessian-based objective.

Core claim

Bit-Plane Decomposition Quantization constructs a variable quantization grid through bit-plane decomposition and scalar coefficients, then iteratively refines these elements with second-order Hessian information to compensate for quantization errors and reduce the difference between quantized and original model outputs.

What carries the argument

Variable quantization grid created by bit-plane decomposition and scalar coefficients, refined iteratively using second-order information in Hessian-induced geometry.

If this is right

A 72B model becomes runnable at 2 bits on a single RTX 3090 GPU while retaining over 80 percent accuracy on GSM8K.
Error minimization at low bit widths improves because the variable grid enlarges the set of reachable quantization points.
The refinement steps stay consistent with the second-order optimization direction, preserving more of the original model behavior.
Post-training quantization extends reliably to the 2-bit regime for inference on memory-limited hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bit-plane structure might reduce memory in other model compression settings such as sparse attention or mixture-of-experts routing.
Hardware kernels built around variable bit-plane layouts could cut both memory traffic and compute time during inference.
Applying the refinement loop to activation tensors rather than only weights could further lower overall quantization error.

Load-bearing premise

Iterative adjustment of the bit-planes and coefficients using second-order information will reduce quantization errors enough to keep outputs close to the original without creating instability or needing model-specific tuning.

What would settle it

Quantize Qwen2.5-72B to 2 bits with BPDQ, run it on the GSM8K benchmark, and check whether accuracy falls well below 80 percent or the measured output discrepancy stays larger than with a fixed-grid baseline.

Figures

Figures reproduced from arXiv: 2602.04163 by Chaofan Tao, He Xiao, Hongxia Yang, Jing Xiong, Jungang Li, Junyu Chen, Long Shi, Mengzhao Chen, Ngai Wong, Qingyao Yang, Taiqiang Wu, Wenjie Wang, Zhen Li, Zhen Peng.

**Figure 1.** Figure 1: (a) Fixed grids (Uniform/Non-Uniform) enforce shape invariance, where the relative spacing of quantization levels is shared across groups (scaled by s). BPDQ breaks this limitation by constructing a variable grid per group using bit-plane coefficients (c1, c2), expanding the feasible set. (b) Performance comparison of 2-bit quantized Qwen2.5-72B. tains a rigorous theoretical formulation but fails in the lo… view at source ↗

**Figure 2.** Figure 2: Overview of the 2-bit BPDQ quantization procedure. 2. Related Work Low-bit Quantization for LLMs. To achieve extreme compression rates, QAT methods optimize in the Boolean domain or utilize factorized representations (Tran & Nguyen, 2025; Lee et al., 2025), albeit at substantial training costs. Among PTQ methods, vector quantization (VQ) maps weights to codebooks (Egiazarian et al., 2024; Liu et al., 2024)… view at source ↗

**Figure 3.** Figure 3: LongBench performance comparison on Qwen2.5-7B. which acts as a stress test for long-range dependency. GPTQ suffers severe degradation (score drops to 4.98%), indicating the loss of retrieval capabilities. In contrast, BPDQ sustains the performance at 53.75%, whereas VPTQ achieves higher resilience but at the cost of prohibitive quantization overhead. Furthermore, in summarization and classification tas… view at source ↗

read the original abstract

Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at https://github.com/KingdalfGoodman/BPDQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BPDQ gets a 72B model to 2 bits with usable accuracy on one GPU, but the theory on feasible-set expansion and Hessian alignment stays thin on derivation.

read the letter

The main point to take away is that BPDQ achieves usable 2-bit performance on a 72B-scale model, allowing Qwen2.5-72B to run on a single RTX 3090 with 83.85% accuracy on GSM8K. This is a meaningful step for practical low-bit serving. The new element here is the bit-plane decomposition approach that builds a variable grid using bit-planes and scalar coefficients, moving away from shape-invariant fixed grids. They then iteratively refine these with second-order information to compensate quantization errors and reduce output discrepancy. This setup is framed as expanding the options for minimization compared to standard PTQ. On the strengths, the paper provides a clear empirical demonstration with a large model and specific numbers, which is more than many abstracts offer. Making the code available helps others verify and build on it. The weaker areas are in the supporting theory and experimental transparency. The claim that the variable grid expands the feasible set and consistently aligns with the Hessian-induced optimization objective lacks a detailed derivation or proof sketch in the provided material. It risks coming across as circular without an independent check or bound on how much the expansion helps. The abstract also omits ablations, error bars, or full protocol, making it hard to assess if the results are robust or general. This paper targets researchers and engineers focused on efficient inference for large language models, especially those dealing with memory constraints in deployment. A reader working on quantization techniques would find the method and results worth examining. I think it deserves peer review. The practical outcome and code release provide enough to justify referee time, with the expectation that revisions would address the theoretical gaps and add more validation details.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Bit-Plane Decomposition Quantization (BPDQ) for post-training quantization of large language models. It argues that existing methods use fixed shape-invariant grids (e.g., uniform UINT2 intervals) that restrict the feasible set for error minimization, and proposes constructing a variable grid via bit-planes and scalar coefficients that are iteratively refined using second-order information to compensate quantization errors and minimize output discrepancy. The central empirical claim is that 2-bit BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). The paper also asserts a theoretical result that the variable grid expands the feasible set and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code is provided at a public repository.

Significance. If the empirical result holds under broader testing and the theoretical alignment is shown with explicit bounds or convergence arguments, the work could meaningfully advance low-bit PTQ for LLMs by relaxing grid constraints while leveraging Hessian geometry. The concrete 72B-model demonstration at 2 bits and open-sourced code are positive for reproducibility and practical impact.

major comments (2)

[Abstract] Abstract: the claim that the variable grid 'expands the feasible set' and that the process 'consistently aligns with the optimization objective in Hessian-induced geometry' is asserted without a quantified comparison (e.g., cardinality or volume bound) or a convergence argument for the iterative second-order refinement; this is load-bearing for the reliability of the 2-bit regime.
[Experimental results] Experimental results: the 83.85% GSM8K figure for 2-bit Qwen2.5-72B is reported without error bars, ablation isolating bit-plane/scalar refinement from the Hessian updates, or protocol details on iteration count and stopping criterion, leaving the weakest assumption about stability and generalizability untested.

minor comments (1)

[Abstract] Abstract: the phrase 'progressively compensating for quantization errors' is used without specifying the discrepancy metric or convergence tolerance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of both the theoretical claims and the experimental results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the variable grid 'expands the feasible set' and that the process 'consistently aligns with the optimization objective in Hessian-induced geometry' is asserted without a quantified comparison (e.g., cardinality or volume bound) or a convergence argument for the iterative second-order refinement; this is load-bearing for the reliability of the 2-bit regime.

Authors: The manuscript derives that the variable grid constructed from bit-planes and scalar coefficients admits a strictly larger set of representable values than fixed uniform grids, thereby expanding the feasible set available for error minimization. The iterative second-order refinement is formulated to follow the curvature directions given by the Hessian approximation, ensuring each step reduces output discrepancy in a manner aligned with the underlying optimization geometry. We acknowledge that the abstract states these results at a high level without explicit cardinality or volume bounds and without a formal convergence argument. We will revise the abstract to moderate the claim language and add an appendix containing a cardinality comparison for representative small cases together with a convergence sketch for the refinement procedure under standard bounded-Hessian assumptions. revision: yes
Referee: [Experimental results] Experimental results: the 83.85% GSM8K figure for 2-bit Qwen2.5-72B is reported without error bars, ablation isolating bit-plane/scalar refinement from the Hessian updates, or protocol details on iteration count and stopping criterion, leaving the weakest assumption about stability and generalizability untested.

Authors: We agree that greater experimental transparency is warranted. We will add the precise iteration count, the stopping criterion based on output-error reduction, and full protocol details to the revised experimental section. We will also include an ablation that isolates the bit-plane decomposition and scalar-coefficient components from the Hessian-driven updates, performed on smaller models where repeated runs remain computationally tractable. Error bars for the 72B model are not feasible to obtain within reasonable resource limits; we will instead report variance across multiple runs on 7B- and 13B-scale models to support claims of stability. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; theoretical claims presented as independent demonstrations

full rationale

The abstract describes BPDQ as constructing a variable grid via bit-planes and scalars, then iteratively refining with second-order information to compensate errors. The theoretical statements—that the variable grid expands the feasible set and that the process aligns with the optimization objective in Hessian-induced geometry—are asserted as shown results rather than definitions or fits. No equations are exhibited that reduce the claimed expansion or alignment to the refinement loop by construction, nor are load-bearing self-citations or renamed empirical patterns identified. Performance numbers on Qwen2.5-72B and GSM8K serve as external benchmarks. The derivation chain remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text. The scalar coefficients and bit-plane construction may implicitly involve fitted values, but none are enumerated.

pith-pipeline@v0.9.0 · 5785 in / 1222 out tokens · 101998 ms · 2026-05-21T14:04:40.716195+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BPDQ constructs a variable quantization grid via bit-planes and scalar coefficients... expands the feasible set... consistently aligns with the optimization objective in Hessian-induced geometry
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the quantization process consistently aligns with the optimization objective in Hessian-induced geometry

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 11 internal anchors

[1]

Lota-qaf: Lossless ternary adaptation for quantization- aware fine-tuning.arXiv preprint arXiv:2505.18724, 2025a

Chen, J., Li, J., Peng, Z., Wang, W., Ren, Y ., Shi, L., and Hu, X. Lota-qaf: Lossless ternary adaptation for quantization- aware fine-tuning.arXiv preprint arXiv:2505.18724, 2025a. Chen, J., Shabanzadeh, Y ., Crnˇcevi´c, E., Hoefler, T., and Alistarh, D. The geometry of llm quantization: Gptq as babai’s nearest plane algorithm.arXiv preprint arXiv:2507.1...

work page arXiv
[2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Elias Frantar and Dan Alistarh

Egiazarian, V ., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., and Alistarh, D. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URL https://zenodo. org/records/12608602. Gong, R., Ding, Y ., Wang, Z., Lv, C., Zheng, X., Du, J., Qin, H., Guo, J., Magno, M., and Liu, X. A survey of low-bit large language models: Basics, systems, and algorithms. arXiv preprint arXiv:2409.16694,

work page arXiv
[8]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[10]

Billm: Pushing the limit of post-training quantization for llms

Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024a. Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X. Slim-llm: Salience- driven mixed-precision quantization for large lang...

work page arXiv
[11]

Littlebit: Ultra low- bit quantization via latent factorization.arXiv preprint arXiv:2506.13771,

Lee, B., Kim, D., You, Y ., and Kim, Y . Littlebit: Ultra low- bit quantization via latent factorization.arXiv preprint arXiv:2506.13771,

work page arXiv
[12]

Arb-llm: Alternating refined binarizations for large language models

Li, Z., Yan, X., Zhang, T., Qin, H., Xie, D., Tian, J., Kong, L., Zhang, Y ., Yang, X., et al. Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129,

work page arXiv
[13]

L., Cao, T., Li, C., and Yang, M

Liu, Y ., Wen, J., Wang, Y ., Ye, S., Zhang, L. L., Cao, T., Li, C., and Yang, M. Vptq: Extreme low-bit vector post- training quantization for large language models.arXiv preprint arXiv:2409.17066,

work page arXiv
[14]

Llm-qat: Data-free quantization aware training for large language models

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888,

work page arXiv
[15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Towards efficient generative large language model serving: A survey from algorithms to systems

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., and Jia, Z. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234,

work page arXiv
[17]

Park, G., Park, B., Kim, M., Lee, S., Kim, J., Kwon, B., Kwon, S

Contact: qubitium@modelcloud.ai. Park, G., Park, B., Kim, M., Lee, S., Kim, J., Kwon, B., Kwon, S. J., Kim, B., Lee, Y ., and Lee, D. Lut-gemm: Quantized matrix multiplication based on luts for effi- cient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557,

work page arXiv
[18]

J., and Lee, D

Park, G., Bae, J., Kwon, B., Kim, B., Kwon, S. J., and Lee, D. Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms.arXiv preprint arXiv:2510.10467,

work page arXiv
[19]

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Tran, B.-H. and Nguyen, V . M. Highly efficient and effective llms with multi-boolean architectures.arXiv preprint arXiv:2505.22811,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

Xiao, H., Yang, R., Yang, Q., Xu, W., Li, Z., Su, Y ., Liu, Z., Yang, H., and Wong, N. Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

work page arXiv
[21]

Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717,

Xu, Y ., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. Qa-lora: Quantization- aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717,

work page arXiv
[22]

Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

Yan, X., Bao, C., Li, Z., Zhang, T., Yang, K., Qin, H., Xie, R., Sun, X., and Zhang, Y . Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

work page arXiv
[23]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[25]

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Zeng, S., Liu, J., Dai, G., et al. Flightllm: Efficient large lan- guage model inference with a complete mapping flow on fpgas. InProceedings of the 2024 ACM/SIGDA Interna- tional Symposium on Field Programmable Gate Arrays. Zhang, H., Zhang, S., Colbert, I., and Saab, R. Provable post-training quantization: Theoretical analysis of optq and qronos.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

2 parameters (c0, s)).Crucially, the fixed grid enforces global rigidity across all groups, whereas the variable grid enables adaptability tailored for each group

are linearly independent, this provides an additional coefficient degree of freedom (3 parameters (c0, c1, c2) vs. 2 parameters (c0, s)).Crucially, the fixed grid enforces global rigidity across all groups, whereas the variable grid enables adaptability tailored for each group. 12 Bit-Plane Decomposition Quantization on a Variable Grid for Large Language ...

work page arXiv

[1] [1]

Lota-qaf: Lossless ternary adaptation for quantization- aware fine-tuning.arXiv preprint arXiv:2505.18724, 2025a

Chen, J., Li, J., Peng, Z., Wang, W., Ren, Y ., Shi, L., and Hu, X. Lota-qaf: Lossless ternary adaptation for quantization- aware fine-tuning.arXiv preprint arXiv:2505.18724, 2025a. Chen, J., Shabanzadeh, Y ., Crnˇcevi´c, E., Hoefler, T., and Alistarh, D. The geometry of llm quantization: Gptq as babai’s nearest plane algorithm.arXiv preprint arXiv:2507.1...

work page arXiv

[2] [2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Elias Frantar and Dan Alistarh

Egiazarian, V ., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., and Alistarh, D. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv

[6] [6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URL https://zenodo. org/records/12608602. Gong, R., Ding, Y ., Wang, Z., Lv, C., Zheng, X., Du, J., Qin, H., Guo, J., Magno, M., and Liu, X. A survey of low-bit large language models: Basics, systems, and algorithms. arXiv preprint arXiv:2409.16694,

work page arXiv

[8] [8]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[10] [10]

Billm: Pushing the limit of post-training quantization for llms

Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024a. Huang, W., Qin, H., Liu, Y ., Li, Y ., Liu, Q., Liu, X., Benini, L., Magno, M., Zhang, S., and Qi, X. Slim-llm: Salience- driven mixed-precision quantization for large lang...

work page arXiv

[11] [11]

Littlebit: Ultra low- bit quantization via latent factorization.arXiv preprint arXiv:2506.13771,

Lee, B., Kim, D., You, Y ., and Kim, Y . Littlebit: Ultra low- bit quantization via latent factorization.arXiv preprint arXiv:2506.13771,

work page arXiv

[12] [12]

Arb-llm: Alternating refined binarizations for large language models

Li, Z., Yan, X., Zhang, T., Qin, H., Xie, D., Tian, J., Kong, L., Zhang, Y ., Yang, X., et al. Arb-llm: Alternating refined binarizations for large language models.arXiv preprint arXiv:2410.03129,

work page arXiv

[13] [13]

L., Cao, T., Li, C., and Yang, M

Liu, Y ., Wen, J., Wang, Y ., Ye, S., Zhang, L. L., Cao, T., Li, C., and Yang, M. Vptq: Extreme low-bit vector post- training quantization for large language models.arXiv preprint arXiv:2409.17066,

work page arXiv

[14] [14]

Llm-qat: Data-free quantization aware training for large language models

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888,

work page arXiv

[15] [15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Towards efficient generative large language model serving: A survey from algorithms to systems

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., and Jia, Z. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234,

work page arXiv

[17] [17]

Park, G., Park, B., Kim, M., Lee, S., Kim, J., Kwon, B., Kwon, S

Contact: qubitium@modelcloud.ai. Park, G., Park, B., Kim, M., Lee, S., Kim, J., Kwon, B., Kwon, S. J., Kim, B., Lee, Y ., and Lee, D. Lut-gemm: Quantized matrix multiplication based on luts for effi- cient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557,

work page arXiv

[18] [18]

J., and Lee, D

Park, G., Bae, J., Kwon, B., Kim, B., Kwon, S. J., and Lee, D. Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms.arXiv preprint arXiv:2510.10467,

work page arXiv

[19] [19]

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Tran, B.-H. and Nguyen, V . M. Highly efficient and effective llms with multi-boolean architectures.arXiv preprint arXiv:2505.22811,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

Xiao, H., Yang, R., Yang, Q., Xu, W., Li, Z., Su, Y ., Liu, Z., Yang, H., and Wong, N. Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

work page arXiv

[21] [21]

Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717,

Xu, Y ., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. Qa-lora: Quantization- aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717,

work page arXiv

[22] [22]

Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

Yan, X., Bao, C., Li, Z., Zhang, T., Yang, K., Qin, H., Xie, R., Sun, X., and Zhang, Y . Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

work page arXiv

[23] [23]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[25] [25]

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Zeng, S., Liu, J., Dai, G., et al. Flightllm: Efficient large lan- guage model inference with a complete mapping flow on fpgas. InProceedings of the 2024 ACM/SIGDA Interna- tional Symposium on Field Programmable Gate Arrays. Zhang, H., Zhang, S., Colbert, I., and Saab, R. Provable post-training quantization: Theoretical analysis of optq and qronos.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

2 parameters (c0, s)).Crucially, the fixed grid enforces global rigidity across all groups, whereas the variable grid enables adaptability tailored for each group

are linearly independent, this provides an additional coefficient degree of freedom (3 parameters (c0, c1, c2) vs. 2 parameters (c0, s)).Crucially, the fixed grid enforces global rigidity across all groups, whereas the variable grid enables adaptability tailored for each group. 12 Bit-Plane Decomposition Quantization on a Variable Grid for Large Language ...

work page arXiv