LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Fengxiang Wang; Haiyan Zhao; Haoyu Wang; Xingyu Yu; Xu Han

arxiv: 2606.10531 · v1 · pith:7F64YIYQnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Haoyu Wang , Xingyu Yu , Haiyan Zhao , Fengxiang Wang , Xu Han This is my paper

Pith reviewed 2026-06-27 13:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords quantization-aware trainingvector quantization2-bit quantizationlarge language modelsdata-efficient fine-tuningpost-training quantizationaffine mapping

0 comments

The pith

LC-QAT lets vector quantization train 2-bit LLMs end-to-end by replacing codebook lookup with a learned affine mapping over discrete vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LC-QAT as a 2-bit weight-only quantization-aware training method that starts from a post-training quantization initialization and then fine-tunes in a fully differentiable manner. It achieves this by representing weights through an affine transformation applied to a set of discrete vectors rather than performing explicit codebook lookups during the forward pass. Because the initialization is already strong, the method requires only a small fraction of the calibration data that scalar quantization QAT approaches normally need. Experiments on multiple large language models show consistent gains over prior QAT techniques at this precision while using between 0.1 percent and 10 percent of the usual training data. A sympathetic reader would care because 2-bit models are a direct route to running large language models on memory-constrained devices.

Core claim

LC-QAT represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient.

What carries the argument

Learned affine mapping over discrete vectors: it converts vector-quantized weights into a continuous, differentiable form that supports gradient-based training while preserving the representational capacity of vector quantization.

If this is right

LC-QAT outperforms existing scalar-based QAT methods at 2-bit precision across diverse LLMs.
The method maintains accuracy while using only 0.1 percent to 10 percent of the training data required by prior approaches.
The resulting models provide a practical route to extreme low-bit deployment of large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same affine-mapping trick could be tested on activation quantization or on mixed-precision schemes that combine 2-bit weights with higher-bit activations.
If the initialization quality is the main driver, similar PTQ-to-QAT pipelines might reduce data needs in other discrete optimization settings such as pruning.
The approach implicitly separates the discrete codebook from the training dynamics, which might allow codebook updates to occur less frequently than weight updates.

Load-bearing premise

The learned affine mapping over discrete vectors produces a high-quality PTQ initialization that enables fully differentiable end-to-end optimization without explicit codebook lookup in the forward pass.

What would settle it

Run the same small-data fine-tuning budget on LC-QAT and on leading scalar QAT baselines across several LLMs; if the performance gap disappears or reverses, the data-efficiency claim does not hold.

Figures

Figures reproduced from arXiv: 2606.10531 by Fengxiang Wang, Haiyan Zhao, Haoyu Wang, Xingyu Yu, Xu Han.

**Figure 1.** Figure 1: LC-QAT training pipeline with a linear-constrained parameterization. By replacing discrete codebook lookup with an SQ-style round/clip discretization followed by an affine projection, LC-QAT makes VQ-QAT lookup-free in the forward pass and compatible with standard end-to-end backpropagation. quantization process without explicit index search. As a result, LC-QAT makes vector-quantized weights trainable und… view at source ↗

**Figure 2.** Figure 2: b shows that the LC-QAT initialization lies in the low-loss basin and exhibits a saddle-point structure similar to that of the full-precision model. In contrast, Figure 2c shows that SQ-based initialization deviates substantially from the optimal region and lacks a nearby local minimum. This phenomenon can be attributed to the fact that vector quantization preserves more information during posttraining c… view at source ↗

**Figure 3.** Figure 3: Overview of the forward and backward pass of LC-QAT. During the forward pass, proxy weights are discretized into integer weights to incorporate quantization errors. The computational workflow is reformulated to leverage Int2-FP16 MatMul kernels, which are well-optimized for SQ models. In the backward pass, by bypassing the traditional codebook lookup operation, LC-QAT enables end-to-end optimization via ap… view at source ↗

**Figure 4.** Figure 4: Average zero-shot task performance over training steps. LC-QAT steadily improves, while PV-Tuning saturates quickly. 5.2. Data Efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Without preprocessing, the training loss remains nearly constant. With preprocessing, the loss decreases continuously, demonstrating that aligning integer weights with a Xavierinitialized distribution is essential for stable training and effective gradient propagation. (b) When using the STE, the spikes are extremely large and difficult to recover. In contrast, using the DGE results in significantly … view at source ↗

**Figure 6.** Figure 6: Examples of FineWeb. Sample1: human: Write a python function to reverse the strings in a given list of strings. For example, given the list [”hello”, ”world”], the function should return [”olleh”, ”dlrow”]. assistant: python def reverse strings(list of strings): return [s[::-1] for s in list of strings] Sample2: human: Write a python function that takes in two integers, a and b, and returns the sum of the … view at source ↗

**Figure 7.** Figure 7: Examples of AM-Qwen3-Distilled showing human instructions and assistant responses. A.2. Inference Speed We report inference throughput on a single NVIDIA A100 GPU with batch size 1 and sequence length 1024 (CUDA Graph enabled). As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LC-QAT tries to fix 2-bit VQ training by adding an affine mapping for differentiability and low data use, but the actual gain depends on details not visible in the abstract.

read the letter

The core idea is a 2-bit weight-only VQ-QAT method that represents weights through a learned affine mapping over discrete vectors. This is meant to give a solid PTQ starting point and let the whole thing train end-to-end without doing explicit codebook lookups in the forward pass.

What stands out is the focus on data efficiency: the claim is that it beats existing QAT methods while using only 0.1% to 10% of the usual training data. That would matter if true, since most low-bit QAT still needs noticeable amounts of data to recover performance.

The soft spots are straightforward. The abstract does not show the equations or the exact optimization setup, so it is not possible to check whether the affine mapping truly removes the non-differentiability problem or simply hides it. The experimental claims also cannot be verified without seeing the actual numbers, model sizes, and baseline implementations. If the gains shrink once the controls are tightened, the practical advantage disappears.

This is aimed at people working on extreme low-bit LLM deployment. A reader who already follows quantization papers would get value from seeing whether the method scales beyond the reported cases. It is worth sending to peer review because the problem is real and the proposed direction is concrete enough to test, even if the current write-up leaves several implementation questions open.

Referee Report

2 major / 3 minor

Summary. The paper introduces LC-QAT, a 2-bit weight-only QAT framework for LLMs based on vector quantization. It represents quantized weights via a learned affine mapping over discrete vectors to obtain a strong PTQ initialization while enabling fully differentiable end-to-end optimization without explicit codebook lookup during the forward pass. Experiments on diverse LLMs show that LC-QAT outperforms prior QAT methods while requiring only 0.1%–10% of the usual training data.

Significance. If the central claims hold, LC-QAT would resolve a key tension between the representational capacity of VQ and the differentiability requirements of QAT at 2-bit precision, offering a practical route to high-quality extreme low-bit LLMs with minimal calibration data. The data-efficiency result, if reproducible, would be particularly valuable for deployment scenarios where large calibration sets are unavailable.

major comments (2)

[§3.2, Eq. (7)] §3.2, Eq. (7): the claim that the affine mapping supplies a lookup-free forward pass is load-bearing for the differentiability argument, yet the manuscript provides no explicit bound or empirical verification that the mapping error remains small enough across layers to preserve the VQ capacity advantage over scalar quantization.
[Table 4] Table 4, rows for Llama-2-7B and OPT-6.7B: the reported perplexity gains versus the strongest SQ-QAT baseline are shown with single-run numbers only; without variance across random seeds or multiple calibration subsets, it is impossible to assess whether the 0.1%–10% data regime reliably outperforms the baselines.

minor comments (3)

[§2.1] §2.1: the notation for the codebook C and the affine parameters (A, b) is introduced without an explicit statement of their dimensions or initialization procedure, which complicates following the subsequent derivation.
[Figure 3] Figure 3: the legend and axis labels are too small to read in print; the caption should also state the exact calibration-set sizes used for each curve.
[§4.3] §4.3: the sentence claiming “parameter-free” behavior after the PTQ stage is imprecise; the affine mapping still contains learned parameters that are frozen post-initialization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. Below we address each major comment.

read point-by-point responses

Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): the claim that the affine mapping supplies a lookup-free forward pass is load-bearing for the differentiability argument, yet the manuscript provides no explicit bound or empirical verification that the mapping error remains small enough across layers to preserve the VQ capacity advantage over scalar quantization.

Authors: We agree that an empirical verification of the mapping error would strengthen the argument. The affine mapping is constructed to be a continuous, differentiable approximation to the discrete VQ operation, and the end-to-end optimization directly minimizes the downstream loss, which implicitly controls the approximation error. Our main results already show that LC-QAT preserves the VQ advantage over SQ-QAT baselines. In the revision we will add a per-layer analysis of the mapping error (e.g., average and max ||W - affine(V)||) on the models in Table 4 to empirically confirm that the error remains small relative to the quantization gap. revision: yes
Referee: [Table 4] Table 4, rows for Llama-2-7B and OPT-6.7B: the reported perplexity gains versus the strongest SQ-QAT baseline are shown with single-run numbers only; without variance across random seeds or multiple calibration subsets, it is impossible to assess whether the 0.1%–10% data regime reliably outperforms the baselines.

Authors: We acknowledge that single-run reporting limits statistical assessment. The reported gains are consistent across six different model families and multiple data regimes, but variance information would indeed be more convincing. In the revised manuscript we will rerun the Llama-2-7B and OPT-6.7B entries with three random seeds (different calibration subset sampling and initialization) and report mean ± std for both LC-QAT and the strongest SQ-QAT baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation

full rationale

The provided abstract and description contain no equations, derivations, or self-referential fitting steps. The core proposal is a design choice (learned affine mapping over discrete vectors for differentiable VQ-QAT) presented as enabling PTQ initialization and end-to-end training, with performance superiority asserted via experiments on diverse LLMs using limited data. No load-bearing self-citations, uniqueness theorems, or predictions that reduce to fitted inputs by construction are visible. The work is self-contained against external benchmarks, with the central claim being comparative empirical results rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method description implies learned affine parameters but supplies no counts or values.

pith-pipeline@v0.9.1-grok · 5723 in / 1079 out tokens · 23037 ms · 2026-06-27T13:30:56.925565+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 15 linked inside Pith

[1]

Unveiling the basin-like loss landscape in large language models.CoRR, abs/2505.17646,

Chen, H., Dong, Y ., Wei, Z., Huang, Y ., Zhang, Y ., Su, H., and Zhu, J. Unveiling the basin-like loss landscape in large language models.CoRR, abs/2505.17646,

arXiv
[2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M.,...

Pith/arXiv arXiv
[3]

Efficientqat: Efficient quantization- aware training for large language models.CoRR, abs/2407.11062,

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization- aware training for large language models.CoRR, abs/2407.11062,

arXiv
[4]

Think you have solved question answering? try arc, the ai2 reasoning challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. CoRR, abs/1803.05457,

Pith/arXiv arXiv
[5]

Training verifiers to solve math word problems.CoRR, abs/2110.14168,

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

Pith/arXiv arXiv
[6]

GPTQ: Accurate post-training compression for gener- ative pretrained transformers.CoRR, abs/2210.17323,

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for gener- ative pretrained transformers.CoRR, abs/2210.17323,

Pith/arXiv arXiv
[7]

Low-precision training of large language models: Methods, challenges, and opportunities

Hao, Z., Guo, J., Shen, L., Luo, Y ., Hu, H., Wang, G., Yu, D., Wen, Y ., and Tao, D. Low-precision training of large language models: Methods, challenges, and opportunities. CoRR, abs/2505.01043,

arXiv
[8]

Measuring massive multitask language understanding.CoRR, abs/2009.03300,

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.CoRR, abs/2009.03300,

Pith/arXiv arXiv 2009
[9]

L., Zhang, K., Wang, C., Yao, Y ., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M

Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y ., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies.CoRR, abs/2404.06395,

Pith/arXiv arXiv
[10]

Let’s verify step by step.CoRR, abs/2305.20050,

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.CoRR, abs/2305.20050,

Pith/arXiv arXiv
[11]

Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,

arXiv
[12]

The llama 3 herd of models.CoRR, abs/2407.21783,

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783,

Pith/arXiv arXiv
[13]

Bitnet b1.58 2b4t technical report

Ma, S., Wang, H., Huang, S., Zhang, X., Hu, Y ., Song, T., Xia, Y ., and Wei, F. Bitnet b1.58 2b4t technical report. CoRR, abs/2504.12285,

arXiv
[14]

Pointer sentinel mixture models.CoRR, abs/1609.07843,

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.CoRR, abs/1609.07843,

Pith/arXiv arXiv
[15]

Can a suit of armor conduct electricity? a new dataset for open book question answering.CoRR, abs/1809.02789,

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering.CoRR, abs/1809.02789,

Pith/arXiv arXiv
[16]

B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L

Penedo, G., Kydl´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V ., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.CoRR, abs/2406.17557,

Pith/arXiv arXiv
[17]

Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and Sa, C. D. QuIP#: Even better llm quantization with hadamard in- coherence and lattice codebooks. InProceedings of the International Conference on Machine Learning, 2024a. Tseng, A., Sun, Q., Hou, D., and De Sa, C. QTIP: quan- tization with trellises and incoherence processing. In Proceedings of the Internat...

Pith/arXiv arXiv
[18]

Optimizing large language model training using fp4 quantization.CoRR, abs/2501.17116,

Wang, R., Gong, Y ., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.CoRR, abs/2501.17116,

Pith/arXiv arXiv
[19]

Qwen3 technical report.CoRR, abs/2505.09388,

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R....

Pith/arXiv arXiv
[20]

Understanding straight-through estimator in training ac- tivation quantized neural nets.CoRR, abs/1903.05662,

Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y ., and Xin, J. Understanding straight-through estimator in training ac- tivation quantized neural nets.CoRR, abs/1903.05662,

arXiv 1903
[21]

Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

Pith/arXiv arXiv
[22]

CCQ: Convolutional code for extreme low-bit quantization in llms.CoRR, abs/2507.07145,

Zhou, Z., Li, X., Li, M., Zhang, H., Wang, H., Chang, W., Liu, Y ., Dang, Q., Yu, D., Ma, Y ., and Wang, H. CCQ: Convolutional code for extreme low-bit quantization in llms.CoRR, abs/2507.07145,

arXiv
[23]

Sample1: human:Write a python function to reverse the strings in a given list of strings

Figure 6.Examples of FineWeb. Sample1: human:Write a python function to reverse the strings in a given list of strings. For example, given the list [”hello”, ”world”], the function should return [”olleh”, ”dlrow”]. assistant:python def reverse strings(list of strings): return [s[::-1] for s in list of strings] Sample2: human:Write a python function that t...

2025

[1] [1]

Unveiling the basin-like loss landscape in large language models.CoRR, abs/2505.17646,

Chen, H., Dong, Y ., Wei, Z., Huang, Y ., Zhang, Y ., Su, H., and Zhu, J. Unveiling the basin-like loss landscape in large language models.CoRR, abs/2505.17646,

arXiv

[2] [2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M.,...

Pith/arXiv arXiv

[3] [3]

Efficientqat: Efficient quantization- aware training for large language models.CoRR, abs/2407.11062,

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization- aware training for large language models.CoRR, abs/2407.11062,

arXiv

[4] [4]

Think you have solved question answering? try arc, the ai2 reasoning challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. CoRR, abs/1803.05457,

Pith/arXiv arXiv

[5] [5]

Training verifiers to solve math word problems.CoRR, abs/2110.14168,

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

Pith/arXiv arXiv

[6] [6]

GPTQ: Accurate post-training compression for gener- ative pretrained transformers.CoRR, abs/2210.17323,

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for gener- ative pretrained transformers.CoRR, abs/2210.17323,

Pith/arXiv arXiv

[7] [7]

Low-precision training of large language models: Methods, challenges, and opportunities

Hao, Z., Guo, J., Shen, L., Luo, Y ., Hu, H., Wang, G., Yu, D., Wen, Y ., and Tao, D. Low-precision training of large language models: Methods, challenges, and opportunities. CoRR, abs/2505.01043,

arXiv

[8] [8]

Measuring massive multitask language understanding.CoRR, abs/2009.03300,

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.CoRR, abs/2009.03300,

Pith/arXiv arXiv 2009

[9] [9]

L., Zhang, K., Wang, C., Yao, Y ., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M

Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y ., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies.CoRR, abs/2404.06395,

Pith/arXiv arXiv

[10] [10]

Let’s verify step by step.CoRR, abs/2305.20050,

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.CoRR, abs/2305.20050,

Pith/arXiv arXiv

[11] [11]

Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,

arXiv

[12] [12]

The llama 3 herd of models.CoRR, abs/2407.21783,

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783,

Pith/arXiv arXiv

[13] [13]

Bitnet b1.58 2b4t technical report

Ma, S., Wang, H., Huang, S., Zhang, X., Hu, Y ., Song, T., Xia, Y ., and Wei, F. Bitnet b1.58 2b4t technical report. CoRR, abs/2504.12285,

arXiv

[14] [14]

Pointer sentinel mixture models.CoRR, abs/1609.07843,

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.CoRR, abs/1609.07843,

Pith/arXiv arXiv

[15] [15]

Can a suit of armor conduct electricity? a new dataset for open book question answering.CoRR, abs/1809.02789,

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering.CoRR, abs/1809.02789,

Pith/arXiv arXiv

[16] [16]

B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L

Penedo, G., Kydl´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V ., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.CoRR, abs/2406.17557,

Pith/arXiv arXiv

[17] [17]

Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and Sa, C. D. QuIP#: Even better llm quantization with hadamard in- coherence and lattice codebooks. InProceedings of the International Conference on Machine Learning, 2024a. Tseng, A., Sun, Q., Hou, D., and De Sa, C. QTIP: quan- tization with trellises and incoherence processing. In Proceedings of the Internat...

Pith/arXiv arXiv

[18] [18]

Optimizing large language model training using fp4 quantization.CoRR, abs/2501.17116,

Wang, R., Gong, Y ., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.CoRR, abs/2501.17116,

Pith/arXiv arXiv

[19] [19]

Qwen3 technical report.CoRR, abs/2505.09388,

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R....

Pith/arXiv arXiv

[20] [20]

Understanding straight-through estimator in training ac- tivation quantized neural nets.CoRR, abs/1903.05662,

Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y ., and Xin, J. Understanding straight-through estimator in training ac- tivation quantized neural nets.CoRR, abs/1903.05662,

arXiv 1903

[21] [21]

Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

Pith/arXiv arXiv

[22] [22]

CCQ: Convolutional code for extreme low-bit quantization in llms.CoRR, abs/2507.07145,

Zhou, Z., Li, X., Li, M., Zhang, H., Wang, H., Chang, W., Liu, Y ., Dang, Q., Yu, D., Ma, Y ., and Wang, H. CCQ: Convolutional code for extreme low-bit quantization in llms.CoRR, abs/2507.07145,

arXiv

[23] [23]

Sample1: human:Write a python function to reverse the strings in a given list of strings

Figure 6.Examples of FineWeb. Sample1: human:Write a python function to reverse the strings in a given list of strings. For example, given the list [”hello”, ”world”], the function should return [”olleh”, ”dlrow”]. assistant:python def reverse strings(list of strings): return [s[::-1] for s in list of strings] Sample2: human:Write a python function that t...

2025