JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

Harshit Khaitan; Kai Yi; Steven Li; Vignesh Vivekraja

arxiv: 2605.25469 · v1 · pith:NLC4HIYCnew · submitted 2026-05-25 · 💻 cs.LG

JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

Kai Yi , Vignesh Vivekraja , Harshit Khaitan , Steven Li This is my paper

Pith reviewed 2026-06-29 22:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords quantization-aware trainingstraight-through estimatorJacobian surrogatelow-bit quantizationLLM compressionvariance-reduced optimizationnon-convex convergence

0 comments

The pith

JacQuant replaces the straight-through estimator with a learned diagonal Jacobian surrogate to stabilize quantization-aware training and reach higher accuracy on LLMs at two bits and below.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantization-aware training typically relies on the straight-through estimator to pass gradients through non-differentiable quantizers, but this creates brittleness near bin boundaries and poor alignment with the final low-precision model. JacQuant instead learns a lightweight diagonal or block-diagonal surrogate of the model's local sensitivity to parameter changes and substitutes it into standard variance-reduced optimizers. The surrogate is data-driven, inexpensive to maintain, and leaves the forward quantizer unchanged. The paper proves convergence for non-convex objectives and linear rates under the PL condition, plus a calibration link between the surrogate and end-to-end output fidelity. On LLM benchmarks at ≤2 bits the method outperforms STE-based QAT while adding negligible runtime cost under practical group sizes.

Core claim

JacQuant learns a data-driven diagonal or block-diagonal approximation to the Jacobian of the model's output with respect to its parameters and uses this surrogate in place of the straight-through estimator during the backward pass, enabling stable training of ultra-low-bit models without any modification to the forward quantization operation.

What carries the argument

A learned diagonal or block-diagonal Jacobian surrogate that approximates the local sensitivity of model output to weight changes and is inserted into variance-reduced optimizers.

If this is right

Higher accuracy than STE-based QAT across LLM benchmarks at ≤2 bits.
Negligible added runtime cost under practical group sizes on various models.
Convergence guarantees for non-convex objectives and linear rates under the PL condition.
Drop-in compatibility with common weight and activation quantizers that leaves the forward pass unchanged.
A simple calibration argument relates the learned sensitivity directly to end-to-end output fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The inexpensive nature of the diagonal surrogate suggests it could be recomputed periodically during long training runs to track distribution shifts.
Similar learned sensitivity surrogates might be applied to other non-differentiable operations such as structured pruning or dynamic mixed-precision allocation.
Because the method requires no change to the forward quantizer, it could be combined with existing quantization libraries without code changes.
The calibration link between surrogate and output fidelity may extend to measuring how well other compression techniques preserve model behavior.

Load-bearing premise

The learned surrogate accurately approximates the model's local sensitivity to parameter changes and can be safely used inside standard variance-reduced optimizers without altering forward quantizer behavior.

What would settle it

On a small model where the true local Jacobian is computed exactly by automatic differentiation, replace JacQuant's learned surrogate with a version whose entries differ by more than a small constant factor and observe whether the accuracy advantage over STE disappears.

Figures

Figures reproduced from arXiv: 2605.25469 by Harshit Khaitan, Kai Yi, Steven Li, Vignesh Vivekraja.

**Figure 2.** Figure 2: Finite-difference gradient diagnostics. (a) JacQuant converges faster than STE and stabilizes within a smaller neighborhood; (b) variance of the coordinate-wise mismatch between a central-difference reference gradient and the approximate training gradients; (c) histogram of the learned group-wise Jacobian scalars bg at the target step, with the dashed line indicating the STE identity (b = 1). Scaling to ot… view at source ↗

read the original abstract

Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boundaries and weakly aligned with the actual behavior of the low-precision model. We introduce JacQuant, a QAT framework that learns a lightweight surrogate of the model's local sensitivity to parameter changes and uses it to stabilize and accelerate training within standard variance-reduced optimizers. The surrogate is inexpensive (diagonal or block-diagonal), data-driven, and compatible with common weight and activation quantizers. On code-preserving training phases, we prove convergence for non-convex objectives and obtain linear rates under a PL condition, and we relate the learned sensitivity to end-to-end output fidelity via a simple calibration argument. Across LLM benchmarks at $\leq 2$ bits, JacQuant consistently reaches higher accuracy than STE-based QAT, and the runtime analyses on various models show that the added cost remains negligible under practical group sizes. The method is drop-in and requires no changes to the forward quantizers; our empirical claims are scoped to ultra-low-bit LLM QAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JacQuant replaces STE with a learned lightweight Jacobian surrogate for QAT gradients, adds convergence theory scoped to code-preserving phases, and reports accuracy gains at 2 bits and below on LLMs with small overhead.

read the letter

The core contribution is a data-driven surrogate for the model's local sensitivity that gets plugged into standard variance-reduced optimizers instead of relying on the straight-through estimator. They keep the forward pass unchanged and train the surrogate to be diagonal or block-diagonal so the extra cost stays low. On the theory side they show convergence for non-convex objectives and linear rates under the PL condition, plus a calibration step that ties the surrogate back to output fidelity.

What stands out is the empirical scope: consistent accuracy improvements over STE-based QAT on LLM benchmarks at ≤2 bits, with runtime numbers showing the overhead is negligible at practical group sizes. The method is presented as drop-in, which is useful for people already running QAT pipelines.

The soft spots are mostly around verification. The convergence claims are limited to code-preserving phases, so it is not clear how much of a full training run this covers. The calibration argument that links the learned sensitivity to end-to-end fidelity is stated but would need a close look at the derivation to confirm it is not just restating the fitting objective. Because the surrogate is trained on the same data the model sees, there is always a risk that any reported gain partly reflects how well the surrogate was tuned rather than a fundamental improvement in gradient quality. The abstract does not compare against other learned surrogate or straight-through alternatives in detail, so the novelty relative to that literature is hard to judge from the summary alone.

This is a paper for researchers working on ultra-low-bit quantization for large models who already know the STE pain points. The combination of scoped theory and targeted experiments is enough to justify sending it to referees; the claims are narrow enough that a review can focus on whether the surrogate actually delivers independent signal and whether the runtime numbers hold under broader conditions.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces JacQuant, a QAT framework that replaces the Straight-Through Estimator with a learned lightweight (diagonal or block-diagonal) surrogate of local parameter sensitivity. The surrogate is integrated into standard variance-reduced optimizers without altering forward quantizer behavior. Convergence is proved for non-convex objectives on code-preserving phases, with linear rates under the PL condition; a calibration argument relates the surrogate to end-to-end output fidelity. Empirically, the method yields higher accuracy than STE-based QAT on LLM benchmarks at ≤2 bits while adding negligible runtime cost under practical group sizes.

Significance. If the surrogate accurately captures sensitivity and the scoped theoretical results hold, the work supplies a practical, drop-in alternative to STE that improves training stability and fidelity alignment in ultra-low-bit LLM quantization. The negligible overhead and compatibility with common quantizers strengthen the practical contribution; the convergence analysis and calibration argument provide theoretical grounding within the stated scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation and claims are self-contained

full rationale

The paper defines a data-driven surrogate Jacobian, applies it within standard variance-reduced optimizers, proves convergence on code-preserving phases using non-convex and PL analysis, and relates sensitivity to fidelity via a calibration step. These steps rely on external optimization theory and empirical benchmarks rather than reducing any prediction or result to a fitted input by construction. No self-citation chains, self-definitional loops, or renamed known results are present in the abstract or described claims. The empirical superiority over STE QAT is scoped to benchmarks and does not tautologically follow from the surrogate definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no extractable free parameters, axioms, or invented entities. The surrogate itself may function as a learned entity, but no details on its training or independence are available.

pith-pipeline@v0.9.1-grok · 5743 in / 1091 out tokens · 22525 ms · 2026-06-29T22:51:27.393105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 2 internal anchors

[1]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

URLhttps://arxiv.org/abs/1805.06085. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

URLhttps://arxiv.org/abs/1606.06160. 12 Contents 1 Introduction 1 2 Related Work 2 3 Method 3 3.1 Preliminaries: Grouped Quantization in LLMs . . . . . . . . . . . . . . . . . . . . 3 3.2 The Core Idea: Learning the Quantization Jacobian . . . . . . . . . . . . . . . . . 3 3.3 Practical Estimation ofB(W). . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits

learn per-layer step sizes, narrowing the accuracy gap at low bitwidths. Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits. RecentSTE-free directions include proximal surrogates (e.g.,PV-T uning(Malinovskii et al., 2024)) and geometric transforms (e.g., theRo...

2024
[4]

interpretsVRasJacobian sketchingfor stochastic oracles, yielding low-variance quasi-gradient updates. A common and effective instantiation of this principle isSAGA-style memory (maintaining a table of historical per-index control variates), but suchdataset-sizedmemory is typically infeasible for LLM training.JacQuantadopts the control-variate principle bu...

2022

[1] [1]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

URLhttps://arxiv.org/abs/1805.06085. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

URLhttps://arxiv.org/abs/1606.06160. 12 Contents 1 Introduction 1 2 Related Work 2 3 Method 3 3.1 Preliminaries: Grouped Quantization in LLMs . . . . . . . . . . . . . . . . . . . . 3 3.2 The Core Idea: Learning the Quantization Jacobian . . . . . . . . . . . . . . . . . 3 3.3 Practical Estimation ofB(W). . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits

learn per-layer step sizes, narrowing the accuracy gap at low bitwidths. Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits. RecentSTE-free directions include proximal surrogates (e.g.,PV-T uning(Malinovskii et al., 2024)) and geometric transforms (e.g., theRo...

2024

[4] [4]

interpretsVRasJacobian sketchingfor stochastic oracles, yielding low-variance quasi-gradient updates. A common and effective instantiation of this principle isSAGA-style memory (maintaining a table of historical per-index control variates), but suchdataset-sizedmemory is typically infeasible for LLM training.JacQuantadopts the control-variate principle bu...

2022