pith. sign in

arxiv: 2605.25469 · v1 · pith:NLC4HIYCnew · submitted 2026-05-25 · 💻 cs.LG

JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

Pith reviewed 2026-06-29 22:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords quantization-aware trainingstraight-through estimatorJacobian surrogatelow-bit quantizationLLM compressionvariance-reduced optimizationnon-convex convergence
0
0 comments X

The pith

JacQuant replaces the straight-through estimator with a learned diagonal Jacobian surrogate to stabilize quantization-aware training and reach higher accuracy on LLMs at two bits and below.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantization-aware training typically relies on the straight-through estimator to pass gradients through non-differentiable quantizers, but this creates brittleness near bin boundaries and poor alignment with the final low-precision model. JacQuant instead learns a lightweight diagonal or block-diagonal surrogate of the model's local sensitivity to parameter changes and substitutes it into standard variance-reduced optimizers. The surrogate is data-driven, inexpensive to maintain, and leaves the forward quantizer unchanged. The paper proves convergence for non-convex objectives and linear rates under the PL condition, plus a calibration link between the surrogate and end-to-end output fidelity. On LLM benchmarks at ≤2 bits the method outperforms STE-based QAT while adding negligible runtime cost under practical group sizes.

Core claim

JacQuant learns a data-driven diagonal or block-diagonal approximation to the Jacobian of the model's output with respect to its parameters and uses this surrogate in place of the straight-through estimator during the backward pass, enabling stable training of ultra-low-bit models without any modification to the forward quantization operation.

What carries the argument

A learned diagonal or block-diagonal Jacobian surrogate that approximates the local sensitivity of model output to weight changes and is inserted into variance-reduced optimizers.

If this is right

  • Higher accuracy than STE-based QAT across LLM benchmarks at ≤2 bits.
  • Negligible added runtime cost under practical group sizes on various models.
  • Convergence guarantees for non-convex objectives and linear rates under the PL condition.
  • Drop-in compatibility with common weight and activation quantizers that leaves the forward pass unchanged.
  • A simple calibration argument relates the learned sensitivity directly to end-to-end output fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The inexpensive nature of the diagonal surrogate suggests it could be recomputed periodically during long training runs to track distribution shifts.
  • Similar learned sensitivity surrogates might be applied to other non-differentiable operations such as structured pruning or dynamic mixed-precision allocation.
  • Because the method requires no change to the forward quantizer, it could be combined with existing quantization libraries without code changes.
  • The calibration link between surrogate and output fidelity may extend to measuring how well other compression techniques preserve model behavior.

Load-bearing premise

The learned surrogate accurately approximates the model's local sensitivity to parameter changes and can be safely used inside standard variance-reduced optimizers without altering forward quantizer behavior.

What would settle it

On a small model where the true local Jacobian is computed exactly by automatic differentiation, replace JacQuant's learned surrogate with a version whose entries differ by more than a small constant factor and observe whether the accuracy advantage over STE disappears.

Figures

Figures reproduced from arXiv: 2605.25469 by Harshit Khaitan, Kai Yi, Steven Li, Vignesh Vivekraja.

Figure 1
Figure 1. Figure 1: Gradient propagation schemes in quantized optimization. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Finite-difference gradient diagnostics. (a) JacQuant converges faster than STE and stabilizes within a smaller neighborhood; (b) variance of the coordinate-wise mismatch between a central-difference reference gradient and the approximate training gradients; (c) histogram of the learned group-wise Jacobian scalars bg at the target step, with the dashed line indicating the STE identity (b = 1). Scaling to ot… view at source ↗
read the original abstract

Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boundaries and weakly aligned with the actual behavior of the low-precision model. We introduce JacQuant, a QAT framework that learns a lightweight surrogate of the model's local sensitivity to parameter changes and uses it to stabilize and accelerate training within standard variance-reduced optimizers. The surrogate is inexpensive (diagonal or block-diagonal), data-driven, and compatible with common weight and activation quantizers. On code-preserving training phases, we prove convergence for non-convex objectives and obtain linear rates under a PL condition, and we relate the learned sensitivity to end-to-end output fidelity via a simple calibration argument. Across LLM benchmarks at $\leq 2$ bits, JacQuant consistently reaches higher accuracy than STE-based QAT, and the runtime analyses on various models show that the added cost remains negligible under practical group sizes. The method is drop-in and requires no changes to the forward quantizers; our empirical claims are scoped to ultra-low-bit LLM QAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces JacQuant, a QAT framework that replaces the Straight-Through Estimator with a learned lightweight (diagonal or block-diagonal) surrogate of local parameter sensitivity. The surrogate is integrated into standard variance-reduced optimizers without altering forward quantizer behavior. Convergence is proved for non-convex objectives on code-preserving phases, with linear rates under the PL condition; a calibration argument relates the surrogate to end-to-end output fidelity. Empirically, the method yields higher accuracy than STE-based QAT on LLM benchmarks at ≤2 bits while adding negligible runtime cost under practical group sizes.

Significance. If the surrogate accurately captures sensitivity and the scoped theoretical results hold, the work supplies a practical, drop-in alternative to STE that improves training stability and fidelity alignment in ultra-low-bit LLM quantization. The negligible overhead and compatibility with common quantizers strengthen the practical contribution; the convergence analysis and calibration argument provide theoretical grounding within the stated scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation and claims are self-contained

full rationale

The paper defines a data-driven surrogate Jacobian, applies it within standard variance-reduced optimizers, proves convergence on code-preserving phases using non-convex and PL analysis, and relates sensitivity to fidelity via a calibration step. These steps rely on external optimization theory and empirical benchmarks rather than reducing any prediction or result to a fitted input by construction. No self-citation chains, self-definitional loops, or renamed known results are present in the abstract or described claims. The empirical superiority over STE QAT is scoped to benchmarks and does not tautologically follow from the surrogate definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no extractable free parameters, axioms, or invented entities. The surrogate itself may function as a learned entity, but no details on its training or independence are available.

pith-pipeline@v0.9.1-grok · 5743 in / 1091 out tokens · 22525 ms · 2026-06-29T22:51:27.393105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    PACT: Parameterized Clipping Activation for Quantized Neural Networks

    URLhttps://arxiv.org/abs/1805.06085. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

  2. [2]

    DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

    URLhttps://arxiv.org/abs/1606.06160. 12 Contents 1 Introduction 1 2 Related Work 2 3 Method 3 3.1 Preliminaries: Grouped Quantization in LLMs . . . . . . . . . . . . . . . . . . . . 3 3.2 The Core Idea: Learning the Quantization Jacobian . . . . . . . . . . . . . . . . . 3 3.3 Practical Estimation ofB(W). . . . . . . . . . . . . . . . . . . . . . . . . . ...

  3. [3]

    Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits

    learn per-layer step sizes, narrowing the accuracy gap at low bitwidths. Despite empirical success, most QAT methods inheritSTE’s bias, which ignores discrete bin geometry and can induce optimization mismatch at 2–4 bits. RecentSTE-free directions include proximal surrogates (e.g.,PV-T uning(Malinovskii et al., 2024)) and geometric transforms (e.g., theRo...

  4. [4]

    interpretsVRasJacobian sketchingfor stochastic oracles, yielding low-variance quasi-gradient updates. A common and effective instantiation of this principle isSAGA-style memory (maintaining a table of historical per-index control variates), but suchdataset-sizedmemory is typically infeasible for LLM training.JacQuantadopts the control-variate principle bu...