Recognition: 2 theorem links
· Lean TheoremCompander-Aligned Query Geometry for Quantized Zeroth-Order Optimization
Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3
The pith
Aligning zeroth-order queries to the compander grid makes query-time residuals exactly zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that for a quantizer Q = φ^{-1} ∘ U ∘ φ, forming Rademacher stencils z ± Δr with z = φ(x) and mapping back to x-space via φ^{-1} removes the grid-span mismatch. Theory decomposes the estimator residuals and proves stationarity bounds free of the residual channel that generic off-grid queries exhibit. Experiments on synthetic functions isolate the channel and confirm its absence under CAQ-ZO, while practical NF4 fine-tuning of Qwen and Llama models yields better performance than the unaligned baseline.
What carries the argument
Compander-aligned query (CAQ) geometry: one-grid-step Rademacher stencils built in the uniform transformed domain before inverse companding.
If this is right
- Generic off-grid queries retain a Δ²/μ² residual channel in stationarity bounds.
- CAQ-ZO achieves exactly zero query-time residual for the same nonuniform quantizer.
- The approach improves fine-tuning results for NF4-quantized Qwen and Llama under fixed budget.
- Query geometry is the key to predicting and controlling ZO behavior in quantized settings.
Where Pith is reading between the lines
- This alignment technique may apply to other low-precision derivative-free methods.
- It underscores the need to match query design to quantization geometry in hardware-constrained optimization.
- Scalability to larger models and different quantizers remains to be explored in follow-up work.
Load-bearing premise
Nonuniform quantization can be exactly represented as the composition Q = φ^{-1} ∘ U ∘ φ, with the stationarity bounds holding for the NF4 quantizer in the experiments.
What would settle it
A direct measurement of the estimator residual or stationarity gap on a controlled quantized problem, expecting the predicted nonzero channel for off-grid queries and zero for CAQ-ZO.
Figures
read the original abstract
Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as $Q = \phi^{-1} \circ U \circ \phi$. CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils $z \pm \Delta r$ in $z = \phi(x)$, maps endpoints back through $\phi^{-1}$, and updates in $z$. Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a $\Delta^2/\mu^2$ residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CAQ-ZO for quantized zeroth-order optimization. It models scalar nonuniform quantization exactly as the composition Q = φ^{-1} ∘ U ∘ φ, identifies a grid-span mismatch in query geometry, decomposes endpoint-rounding residuals in the finite-difference estimator, and derives stationarity bounds showing that generic off-grid queries retain a Δ²/μ² residual channel while CAQ-ZO (one-grid-step Rademacher stencils in the companded domain) makes the query-time residual exactly zero. Synthetic experiments isolate the channel and matched NF4 fine-tuning on Qwen/Llama models reports gains over the quantized baseline under fixed quantizer and evaluation budget.
Significance. If the central claims hold, the work supplies a principled, low-overhead correction for quantization-induced bias in ZO gradient estimates that is directly applicable to memory-efficient adaptation of large models. The explicit decomposition of residuals and the parameter-free zero-residual guarantee under the stated model are technically clean contributions; the real-model NF4 experiments add practical weight. The approach could inform the design of future quantized ZO and related low-precision optimizers.
major comments (2)
- [Theory (stationarity bounds derivation)] The stationarity bounds and the claim that CAQ-ZO achieves exactly zero query-time residual are derived under the exact representation Q = φ^{-1} ∘ U ∘ φ. Practical NF4 (with block scaling, clipping, and non-ideal rounding) may introduce additional unmodeled terms in the finite-difference estimator; the manuscript must either prove that these terms remain negligible or bound their effect on the residual channel, as this assumption is load-bearing for the 'exactly zero' result.
- [Experiments] The experimental section reports that synthetic runs isolate the residual channel and that NF4 Qwen/Llama fine-tuning shows gains, yet the provided description lacks explicit error-bar statistics, number of independent runs, and data-exclusion criteria. Without these, it is impossible to confirm that the observed improvements are statistically robust and not sensitive to post-hoc choices.
minor comments (1)
- [Abstract and §2] Notation for the compander φ and the grid step Δ should be introduced with a single forward reference to the model equation to avoid repeated re-definition.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of the theoretical assumptions and experimental reporting. We address each major comment below, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Theory (stationarity bounds derivation)] The stationarity bounds and the claim that CAQ-ZO achieves exactly zero query-time residual are derived under the exact representation Q = φ^{-1} ∘ U ∘ φ. Practical NF4 (with block scaling, clipping, and non-ideal rounding) may introduce additional unmodeled terms in the finite-difference estimator; the manuscript must either prove that these terms remain negligible or bound their effect on the residual channel, as this assumption is load-bearing for the 'exactly zero' result.
Authors: We agree that the stationarity bounds and the exact-zero residual guarantee are derived under the idealized model Q = φ^{-1} ∘ U ∘ φ, which captures the core nonuniform companding behavior. The manuscript already notes that this is an exact representation for the scalar quantizer without block scaling. For practical NF4, block-wise scaling, clipping, and non-ideal rounding introduce secondary perturbations. Our synthetic experiments isolate the grid-span mismatch residual under the model, while the NF4 fine-tuning results on Qwen and Llama demonstrate that CAQ-ZO still yields measurable gains over the quantized baseline under identical quantizer and budget. In the revision we will add a new subsection that (i) explicitly states the scope of the idealized model, (ii) derives a first-order bound showing that the additional residual terms from block scaling and clipping contribute at most O(Δ) to the estimator (rather than inflating the Δ²/μ² channel), and (iii) reports an empirical ablation on a small model confirming that these terms remain small relative to the compander-induced residual for typical NF4 block sizes. This addresses the load-bearing nature of the assumption without overstating the guarantee. revision: partial
-
Referee: [Experiments] The experimental section reports that synthetic runs isolate the residual channel and that NF4 Qwen/Llama fine-tuning shows gains, yet the provided description lacks explicit error-bar statistics, number of independent runs, and data-exclusion criteria. Without these, it is impossible to confirm that the observed improvements are statistically robust and not sensitive to post-hoc choices.
Authors: We acknowledge that the current manuscript does not report error bars, the number of independent runs, or data-exclusion criteria, which limits assessment of statistical robustness. In the revised version we will expand the experimental section to include: results averaged over 5 independent runs with standard-error bars for both synthetic and NF4 fine-tuning experiments; explicit statement that no data points or runs were excluded; and the random seeds used for reproducibility. The synthetic isolation experiments will additionally report variance across multiple random quantization grids. These additions will make the statistical claims verifiable. revision: yes
Circularity Check
No significant circularity; derivation follows directly from explicit model and standard assumptions
full rationale
The paper adopts the quantization representation Q = φ^{-1} ∘ U ∘ φ as an explicit modeling assumption and derives the grid-span mismatch, residual decomposition, and stationarity bounds (including the Δ²/μ² channel for off-grid queries and exact zero for CAQ-ZO) from this model combined with standard ZO finite-difference analysis. CAQ-ZO is defined to place stencils on the uniform grid in the companded space z = φ(x), so the zero-residual property holds by direct substitution into the model rather than by fitting or self-referential closure. No load-bearing self-citations, no parameters fitted to data then relabeled as predictions, and no uniqueness theorems imported from prior author work. Synthetic experiments isolate the modeled channel while NF4 runs use the same quantizer family as the assumption; the chain is self-contained against the stated model.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nonuniform quantization is exactly representable as Q = φ^{-1} ∘ U ∘ φ for some compander φ
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We write scalar nonuniform quantization as a companding quantizer Q=ϕ^{-1}∘U∘ϕ, where U is a uniform grid in the coordinate z=ϕ(x). ... CAQ-ZO forms one-grid-step Rademacher stencils z±∆r in z=ϕ(x)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2(Endpoint-Rounding Estimator Residual)... for CAQ-ZO endpointsz±∆rk, ifz∈G, then ˆ∇(F◦Q◦ϕ^{-1})(z)−ˆ∇(F◦ϕ^{-1})(z)=0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ZOQO: Zero-order quantized optimization
Noga Bar and Raja Giryes. ZOQO: Zero-order quantized optimization. InProc. ICASSP, 2025
work page 2025
-
[2]
Low-rank quantization-aware training for LLMs
Yelysei Bondarenko, Riccardo Del Chiaro, and Markus Nagel. Low-rank quantization-aware training for LLMs. arXiv:2406.06385, arXiv, 2024
-
[3]
EfficientQAT: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProc. ACL, 2025
work page 2025
-
[4]
Test-time model adaptation for quantized neural networks
Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, and Mingkui Tan. Test-time model adaptation for quantized neural networks. InProc. ACM MM, 2025
work page 2025
-
[5]
QLoRA: Efficient finetuning of quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. NeurIPS, 2023
work page 2023
-
[6]
John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Trans. Inf. Theory, 61(5):2788–2806, 2015
work page 2015
-
[7]
Stepping forward on the last mile
Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, and Andrew Zou Li. Stepping forward on the last mile. InProc. NeurIPS, 2024
work page 2024
-
[8]
Yasong Feng and Tianyu Wang. Stochastic zeroth-order gradient and Hessian estimators: Variance reduction and refined bias bounds.Inf. Inference, 12(3):1514–1545, 2023
work page 2023
-
[9]
OPTQ: Accurate quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InProc. ICLR, 2023
work page 2023
-
[10]
Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J. Optim., 23(4):2341–2368, 2013
work page 2013
-
[11]
Robert M. Gray and David L. Neuhoff. Quantization.IEEE Trans. Inf. Theory, 44(6):2325–2383, 1998
work page 1998
-
[12]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022
work page 2022
-
[13]
G.711: Pulse code modulation (PCM) of voice frequencies
ITU-T. G.711: Pulse code modulation (PCM) of voice frequencies. Recommendation ITU-T G.711, International Telecommunication Union, 1988
work page 1988
-
[14]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProc. CVPR, 2018
work page 2018
-
[15]
N. S. Jayant and Peter Noll.Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice-Hall, 1984
work page 1984
-
[16]
LoftQ: LoRA-fine-tuning-aware quantization for large language models
Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. LoftQ: LoRA-fine-tuning-aware quantization for large language models. InProc. ICLR, 2024
work page 2024
-
[17]
AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProc. MLSys, 2024
work page 2024
-
[18]
Stuart P. Lloyd. Least squares quantization in PCM.IEEE Trans. Inf. Theory, 28(2):129–137, 1982
work page 1982
-
[19]
Lee, Danqi Chen, and Sanjeev Arora
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InProc. NeurIPS, 2023. 10
work page 2023
-
[20]
Quantizing for minimum distortion.IRE Trans
Joel Max. Quantizing for minimum distortion.IRE Trans. Inf. Theory, 6(1):7–12, 1960
work page 1960
-
[21]
Random gradient-free minimization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. F ound. Comput. Math., 17(2):527–566, 2017
work page 2017
-
[22]
Qwen Team. Qwen2.5 technical report. arXiv:2412.15115, arXiv, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback.JMLR, 18(52):1–11, 2017
work page 2017
-
[24]
Fine-tuning quantized neural networks with zeroth-order optimization
Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, and Kaiyang Zhou. Fine-tuning quantized neural networks with zeroth-order optimization. InProc. ICLR, 2026
work page 2026
-
[25]
James C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE Trans. Automat. Control, 37(3):332–341, 1992
work page 1992
-
[26]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Quantized evolution strategies: High-precision fine-tuning of quantized LLMs at low-precision cost
Yinggan Xu, Risto Miikkulainen, and Xin Qiu. Quantized evolution strategies: High-precision fine-tuning of quantized LLMs at low-precision cost. arXiv:2602.03120, arXiv, 2026
-
[28]
Poor man’s training on MCUs: A memory-efficient quantized back-propagation-free approach
Yequan Zhao, Hai Li, Ian Young, and Zheng Zhang. Poor man’s training on MCUs: A memory-efficient quantized back-propagation-free approach. arXiv:2411.05873, arXiv, 2024
-
[29]
QuZO: Quantized zeroth-order fine-tuning for large language models
Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, and Zheng Zhang. QuZO: Quantized zeroth-order fine-tuning for large language models. InProc. EMNLP, 2025. A Related Work Classical derivative-free and simultaneous-perturbation methods estimate gradients from function values rather than backprop...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.