arxiv: 2602.20309 · v4 · submitted 2026-02-23 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang , Yunta Hsieh , Zhongwei Wan , Haokun Lin , Xin Wang , Ziqi Wang , Yingtie Lei , Mi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationvision-language-action modelsdiffusion transformerembodied AIlow-bit inferenceLIBERO benchmarkscale calibration

0 comments

The pith

QuantVLA is a post-training quantization method for vision-language-action models that matches or exceeds full-precision task success while cutting quantized-component memory by about 70 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QuantVLA as the first training-free post-training quantization framework for VLA models, including the first successful quantization of a diffusion transformer action head. It applies three scale-calibrated techniques: a selective layout that integerizes linear layers but keeps attention projections in float, per-head temperature scaling folded into dequantization, and per-layer residual balancing to correct output drift. These steps use only a small unlabeled calibration buffer, leave the model architecture unchanged, and support integer kernels for weights and activations. On representative VLA models evaluated on LIBERO, the quantized versions surpass full-precision baselines in task success rate.

Core claim

QuantVLA shows that careful per-head and per-layer scale calibration during post-training quantization allows VLA models to be reduced to low-bit weights and activations without retraining, yielding both memory savings on the quantized parts and higher task success rates than the original full-precision models on the LIBERO benchmark.

What carries the argument

The combination of selective quantization layout, attention temperature matching, and output head balancing, which together set and fold scales to stabilize attention logits and residual energy after quantization.

If this is right

VLA models become deployable on hardware with tight memory and power budgets while retaining or improving control performance.
Integer arithmetic kernels can replace floating-point ones for the majority of linear operations without architectural changes.
Scaling to longer-horizon or larger-backbone VLA systems becomes feasible under the same compute constraints.
No retraining step is required, so existing pretrained checkpoints can be quantized directly for new hardware targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibration approach may transfer to other multimodal diffusion-based action heads in robotics or simulation.
The observed performance gains hint that quantization noise can sometimes regularize the policy; this could be tested by ablating the calibration steps.
Memory savings of this magnitude would allow on-device inference loops at higher frame rates for real-world robot control.

Load-bearing premise

A small unlabeled calibration buffer is representative enough to set per-head temperature scales and per-layer residual balances without introducing distribution shift that would degrade performance on unseen tasks.

What would settle it

Measure task success rates of the quantized model on a new set of LIBERO-style tasks drawn from environments absent from the calibration buffer and check whether rates fall below the full-precision baseline.

Figures

Figures reproduced from arXiv: 2602.20309 by Haokun Lin, Jingxuan Zhang, Mi Zhang, Xin Wang, Yingtie Lei, Yunta Hsieh, Zhongwei Wan, Ziqi Wang.

**Figure 1.** Figure 1: Comparison of representative VLA efficiency frameworks. (1) TinyVLA focuses on compact multimodal transformers and lightweight diffusion-policy heads for architectural efficiency; (2) EfficientVLA accelerates inference by pruning redundant language layers and reusing intermediate representations; (3) VLA-Cache improves throughput through key–value reuse and static caching of vision tokens; (4) MoLe-VLA ado… view at source ↗

**Figure 2.** Figure 2: Overview of QuantVLA for VLAs with a DiT-based action head. The framework is training-free and preserves the original architecture and operator schedule. It combines: (1) a selective quantization layout that integerizes all linear layers in the LLM and all MLP layers in the DiT while keeping the attention projections Q, K, V , O in floating point; (2) Attention Temperature Matching (ATM), a per-head scalar… view at source ↗

**Figure 3.** Figure 3: ATM and OHB effects across attention blocks. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Memory saving of QuantVLA over the baseline on [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QuantVLA claims better-than-full-precision results on VLA tasks via simple calibration, but lacks the ablations needed to trust the gains.

read the letter

QuantVLA is the first post-training quantization method for vision-language-action models, and it is the first to handle a diffusion transformer action head with these specific calibration steps. It keeps the model architecture the same and only needs a small unlabeled set to set the scales. The approach does a few things right. It leaves attention projections in float to avoid changing the operator schedule, folds the temperature scales into dequantization, and balances residuals per layer. This setup aims for integer kernels on weights and activations while trying to preserve performance. The reported outcome on LIBERO tasks is that quantized versions beat the original full-precision success rates with roughly 70 percent memory reduction on the quantized components. That would be a nice result if it holds. The soft spots are clear from the abstract. There are no error bars, no ablation tables on the calibration buffer size or the individual components, and no discussion of how sensitive the results are to the choice of calibration data. The central assumption is that a small buffer captures the right statistics for per-head temperatures and per-layer balances across different tasks and environments. If that does not hold, the gains could disappear or reverse on new data. The stress-test concern about distribution shift in the calibration buffer seems on point here. Without checks like varying the buffer or holding out calibration data, it is difficult to separate real improvement from calibration luck. Still, the framework is simple enough that others could reproduce the calibration process and test it themselves. This paper would interest people working on deploying VLA models for robotics on resource-constrained devices. It is not deep on theory, but it offers a concrete starting point for low-bit embodied AI. I would bring it to a reading group to walk through the calibration details and see if the numbers make sense under scrutiny. It should go to peer review. The topic is timely, the method is practical, and the claims are testable even if the current evidence needs strengthening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces QuantVLA, a training-free post-training quantization (PTQ) framework for vision-language-action (VLA) models. It selectively quantizes linear layers in the language backbone and diffusion transformer (DiT) action head while retaining attention projections in floating point, and introduces two calibration-based mechanisms—per-head attention temperature matching folded into dequantization scales and per-layer output-head residual balancing—derived from a small unlabeled calibration buffer. On LIBERO benchmarks the method is reported to exceed full-precision task success rates while delivering approximately 70% relative memory savings on the quantized components, without altering the model architecture or requiring retraining.

Significance. If the empirical claims prove robust, the work would represent a meaningful advance as the first PTQ approach for VLA systems and the first successful quantization of a DiT action head. The combination of memory reduction with reported performance gains over FP32 baselines would be practically relevant for deploying embodied agents under compute and power constraints. The training-free nature and use of a small calibration buffer are attractive engineering features, though the absence of ablations and statistical detail currently limits confidence in the generality of the result.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Results): the central claim that QuantVLA exceeds full-precision task success rates is presented without error bars, number of random seeds, or statistical significance tests. Because the scales are fitted on a small unlabeled buffer, this omission makes it impossible to determine whether the reported gains are robust or could be explained by calibration variance or task-specific distribution shift.
[§3.2 and §3.3] §3.2 (Attention Temperature Matching) and §3.3 (Output Head Balancing): the manuscript provides no description of the calibration-buffer size, selection procedure, or sensitivity analysis (e.g., buffer-size sweeps or held-out calibration ablations). These parameters directly determine the per-head temperature scales and residual balancing factors; without such evidence the assumption that the buffer is representative of the full LIBERO task distribution remains unverified and load-bearing for the performance claim.
[§4, Table 1] §4, Table 1 (or equivalent results table): the reported 70% relative memory savings are stated at a high level without breakdown by component (weights vs. activations, language backbone vs. DiT head) or comparison against standard PTQ baselines such as GPTQ or SmoothQuant. This detail is necessary to assess whether the selective quantization layout is the primary driver of the savings.

minor comments (2)

[Abstract] The abstract states that the framework 'supports integer kernels' but does not specify which kernels or hardware targets were used in the reported timing or memory measurements.
[§3.2] Notation for the folded dequantization scales in the attention-temperature mechanism could be clarified with an explicit equation showing how the per-head temperature is absorbed into the scale factor at inference time.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major point below and commit to revisions that strengthen the empirical support and clarity of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that QuantVLA exceeds full-precision task success rates is presented without error bars, number of random seeds, or statistical significance tests. Because the scales are fitted on a small unlabeled buffer, this omission makes it impossible to determine whether the reported gains are robust or could be explained by calibration variance or task-specific distribution shift.

Authors: We agree that statistical rigor is essential. In the revised manuscript we will report task success rates averaged over five independent random seeds, include standard-error bars, and add paired t-test p-values against the full-precision baseline. These additions will confirm that the observed improvements are statistically significant and robust to calibration variance. revision: yes
Referee: [§3.2 and §3.3] §3.2 (Attention Temperature Matching) and §3.3 (Output Head Balancing): the manuscript provides no description of the calibration-buffer size, selection procedure, or sensitivity analysis (e.g., buffer-size sweeps or held-out calibration ablations). These parameters directly determine the per-head temperature scales and residual balancing factors; without such evidence the assumption that the buffer is representative of the full LIBERO task distribution remains unverified and load-bearing for the performance claim.

Authors: We will expand §§3.2–3.3 with the missing details: the calibration buffer consists of 256 randomly sampled unlabeled trajectories drawn from the LIBERO training tasks. We will also add a sensitivity study showing stable performance for buffer sizes 128–512 and held-out calibration ablations demonstrating that the learned scales generalize to unseen tasks. revision: yes
Referee: [§4, Table 1] §4, Table 1 (or equivalent results table): the reported 70% relative memory savings are stated at a high level without breakdown by component (weights vs. activations, language backbone vs. DiT head) or comparison against standard PTQ baselines such as GPTQ or SmoothQuant. This detail is necessary to assess whether the selective quantization layout is the primary driver of the savings.

Authors: We accept that a finer-grained breakdown and baseline comparisons are needed. The revised Table 1 and surrounding text will report memory savings separately for weights and activations in both the language backbone and DiT head. We will also add direct comparisons against GPTQ and SmoothQuant applied to the same selective quantization layout, isolating the contribution of our temperature-matching and residual-balancing techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical calibration method with held-out evaluation

full rationale

The paper describes a training-free PTQ framework whose core operations (selective layout, per-head temperature matching, per-layer residual balancing) are determined by fitting scales to a small unlabeled calibration buffer. The headline performance claims consist of measured task success rates on held-out LIBERO tasks for representative VLA models; these rates are external observables and do not reduce by construction to the fitted scale values. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach depends on empirical calibration of per-head and per-layer scaling factors from a small unlabeled buffer; these act as free parameters whose values are determined after training rather than derived from first principles.

free parameters (2)

per-head attention temperature scales
Lightweight scaling factors fitted per attention head from the calibration buffer and folded into dequantization.
per-layer residual balancing factors
Calibration values chosen to correct post-projection energy drift in the output heads.

axioms (1)

domain assumption Selective integerization of linear layers while retaining attention projections in floating point preserves the original operator schedule and performance.
Invoked in the description of the selective quantization layout.

pith-pipeline@v0.9.0 · 5584 in / 1378 out tokens · 40654 ms · 2026-05-15T20:16:57.553938+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

attention temperature matching... αraw = Std(LT)/Std(LQ)... output head balancing... βraw(l) = RMS(ZT,l)/RMS(ZQ,l)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selective quantization layout... keeping attention projections in floating point

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
cs.CV 2026-04 unverdicted novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 2 Pith papers · 19 internal anchors

[1]

Quarot: Outlier-free 4- bit inference in rotated llms,

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456, 2024. 3

work page arXiv 2024
[2]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3

work page 2024
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 2

work page 2025
[8]

Stbllm: Breaking the 1-bit barrier with struc- tured binary llms.arXiv preprint arXiv:2408.01803, 2024

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with struc- tured binary llms.arXiv preprint arXiv:2408.01803, 2024. 3

work page arXiv 2024
[9]

Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 3

work page 2023
[10]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025. 3

work page arXiv 2025
[12]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision–language–action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 3

work page 2025
[14]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

work page 2022
[16]

Svdquant: Absorbing outliers by low- rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low- rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024. 3

work page arXiv 2024
[17]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Infor- mation Processing Systems, 37:87766–87800, 2024

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Infor- mation Processing Systems, 37:87766–87800, 2024. 2, 3, 4

work page 2024
[19]

Quantization meets dllms: A systematic study of post-training quantization for diffusion llms.arXiv preprint arXiv:2508.14896, 2025

Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, and Zhenan Sun. Quantization meets dllms: A systematic study of post-training quantization for diffusion llms.arXiv preprint arXiv:2508.14896, 2025. 3

work page arXiv 2025
[20]

Efficient diffusion language models: A comprehensive survey.Authorea Preprints, 2026

Haokun Lin, Xinle Jia, Shaozhen Liu, Shujun Xia, Weitao Huang, Haobo Xu, Junyang Li, Yicheng Xiao, Xingrun Xing, Ziyu Guo, et al. Efficient diffusion language models: A comprehensive survey.Authorea Preprints, 2026. 3

work page 2026
[21]

Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024. 3

work page 2024
[22]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 2, 6

work page 2023
[24]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Post-training quantization for vision trans- former.Advances in Neural Information Processing Systems, 34:28092–28103, 2021

Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision trans- former.Advances in Neural Information Processing Systems, 34:28092–28103, 2021. 2

work page 2021
[26]

Up or down? adap- tive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Chris- tos Louizos, and Tijmen Blankevoort. Up or down? adap- tive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR,

work page
[27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[29]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025

Moritz Reuss, Hongyi Zhou, Marcel R ¨uhle, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025. 3

work page arXiv 2025
[31]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth In- ternational Conference on Learning Representations, 2023. 3

work page 2023
[32]

Efficient diffusion models: A survey

Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, et al. Efficient diffusion models: A survey. arXiv preprint arXiv:2502.06805, 2025. 4

work page arXiv 2025
[33]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 4

work page internal anchor Pith review Pith/arXiv arXiv 2011
[35]

Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024. 3

work page arXiv 2024
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.Advances in neural information processing systems, 37:124420–124450, 2024

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.Advances in neural information processing systems, 37:124420–124450, 2024. 2

work page 2024
[38]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 1, 3

work page 2025
[39]

Ptq4dit: Post-training quantization for diffu- sion transformers.Advances in neural information process- ing systems, 37:62732–62755, 2024

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. Ptq4dit: Post-training quantization for diffu- sion transformers.Advances in neural information process- ing systems, 37:62732–62755, 2024. 3, 12

work page 2024
[40]

Smoothquant: Accurate and effi- cient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In International conference on machine learning, pages 38087– 38099. PMLR, 2023. 2, 3

work page 2023
[41]

Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025. 1, 3

work page arXiv 2025
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024

Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024. 3

work page arXiv 2024
[44]

Lrq-dit: Log-rotation post-training quantization of diffusion transformers for text-to-image generation.arXiv preprint arXiv:2508.03485, 2025

Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. Lrq-dit: Log-rotation post-training quantization of diffusion transformers for text-to-image generation.arXiv preprint arXiv:2508.03485, 2025. 3

work page arXiv 2025
[45]

Efficientvla: Training-free acceleration and com- pression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efficientvla: Training-free acceleration and com- pression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025. 1, 3

work page arXiv 2025
[46]

Rptq: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089, 2023

Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089, 2023. 3

work page arXiv 2023
[47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4

work page 2023
[48]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 1, 3

work page arXiv 2025
[49]

Vidit-q: Efficient and accurate quantization of diffusion transformers for im- age and video generation.arXiv preprint arXiv:2406.02540,

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for im- age and video generation.arXiv preprint arXiv:2406.02540,

work page arXiv
[50]

Mixdq: Memory-efficient few-step text-to-image dif- fusion models with metric-decoupled mixed precision quan- tization

Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, and Yu Wang. Mixdq: Memory-efficient few-step text-to-image dif- fusion models with metric-decoupled mixed precision quan- tization. InEuropean Conference on Computer Vision, pages 285–302. Springer, 2024. 3, 12

work page 2024
[51]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision- language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025. 1

work page arXiv 2025
[54]

Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning

Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, et al. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning. arXiv preprint arXiv:2506.06072, 2025. 3

work page arXiv 2025
[55]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compo- sitional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 3

work page internal anchor Pith review arXiv 2024
[56]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 A. General Quantization Formulations Post-training quantization (PTQ) [26, 39, 50] r...

work page 2023