Recognition: 2 theorem links
· Lean TheoremQuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3
The pith
QuantVLA is a post-training quantization method for vision-language-action models that matches or exceeds full-precision task success while cutting quantized-component memory by about 70 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QuantVLA shows that careful per-head and per-layer scale calibration during post-training quantization allows VLA models to be reduced to low-bit weights and activations without retraining, yielding both memory savings on the quantized parts and higher task success rates than the original full-precision models on the LIBERO benchmark.
What carries the argument
The combination of selective quantization layout, attention temperature matching, and output head balancing, which together set and fold scales to stabilize attention logits and residual energy after quantization.
If this is right
- VLA models become deployable on hardware with tight memory and power budgets while retaining or improving control performance.
- Integer arithmetic kernels can replace floating-point ones for the majority of linear operations without architectural changes.
- Scaling to longer-horizon or larger-backbone VLA systems becomes feasible under the same compute constraints.
- No retraining step is required, so existing pretrained checkpoints can be quantized directly for new hardware targets.
Where Pith is reading between the lines
- The same calibration approach may transfer to other multimodal diffusion-based action heads in robotics or simulation.
- The observed performance gains hint that quantization noise can sometimes regularize the policy; this could be tested by ablating the calibration steps.
- Memory savings of this magnitude would allow on-device inference loops at higher frame rates for real-world robot control.
Load-bearing premise
A small unlabeled calibration buffer is representative enough to set per-head temperature scales and per-layer residual balances without introducing distribution shift that would degrade performance on unseen tasks.
What would settle it
Measure task success rates of the quantized model on a new set of LIBERO-style tasks drawn from environments absent from the calibration buffer and check whether rates fall below the full-precision baseline.
Figures
read the original abstract
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces QuantVLA, a training-free post-training quantization (PTQ) framework for vision-language-action (VLA) models. It selectively quantizes linear layers in the language backbone and diffusion transformer (DiT) action head while retaining attention projections in floating point, and introduces two calibration-based mechanisms—per-head attention temperature matching folded into dequantization scales and per-layer output-head residual balancing—derived from a small unlabeled calibration buffer. On LIBERO benchmarks the method is reported to exceed full-precision task success rates while delivering approximately 70% relative memory savings on the quantized components, without altering the model architecture or requiring retraining.
Significance. If the empirical claims prove robust, the work would represent a meaningful advance as the first PTQ approach for VLA systems and the first successful quantization of a DiT action head. The combination of memory reduction with reported performance gains over FP32 baselines would be practically relevant for deploying embodied agents under compute and power constraints. The training-free nature and use of a small calibration buffer are attractive engineering features, though the absence of ablations and statistical detail currently limits confidence in the generality of the result.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that QuantVLA exceeds full-precision task success rates is presented without error bars, number of random seeds, or statistical significance tests. Because the scales are fitted on a small unlabeled buffer, this omission makes it impossible to determine whether the reported gains are robust or could be explained by calibration variance or task-specific distribution shift.
- [§3.2 and §3.3] §3.2 (Attention Temperature Matching) and §3.3 (Output Head Balancing): the manuscript provides no description of the calibration-buffer size, selection procedure, or sensitivity analysis (e.g., buffer-size sweeps or held-out calibration ablations). These parameters directly determine the per-head temperature scales and residual balancing factors; without such evidence the assumption that the buffer is representative of the full LIBERO task distribution remains unverified and load-bearing for the performance claim.
- [§4, Table 1] §4, Table 1 (or equivalent results table): the reported 70% relative memory savings are stated at a high level without breakdown by component (weights vs. activations, language backbone vs. DiT head) or comparison against standard PTQ baselines such as GPTQ or SmoothQuant. This detail is necessary to assess whether the selective quantization layout is the primary driver of the savings.
minor comments (2)
- [Abstract] The abstract states that the framework 'supports integer kernels' but does not specify which kernels or hardware targets were used in the reported timing or memory measurements.
- [§3.2] Notation for the folded dequantization scales in the attention-temperature mechanism could be clarified with an explicit equation showing how the per-head temperature is absorbed into the scale factor at inference time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments. We address each major point below and commit to revisions that strengthen the empirical support and clarity of the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that QuantVLA exceeds full-precision task success rates is presented without error bars, number of random seeds, or statistical significance tests. Because the scales are fitted on a small unlabeled buffer, this omission makes it impossible to determine whether the reported gains are robust or could be explained by calibration variance or task-specific distribution shift.
Authors: We agree that statistical rigor is essential. In the revised manuscript we will report task success rates averaged over five independent random seeds, include standard-error bars, and add paired t-test p-values against the full-precision baseline. These additions will confirm that the observed improvements are statistically significant and robust to calibration variance. revision: yes
-
Referee: [§3.2 and §3.3] §3.2 (Attention Temperature Matching) and §3.3 (Output Head Balancing): the manuscript provides no description of the calibration-buffer size, selection procedure, or sensitivity analysis (e.g., buffer-size sweeps or held-out calibration ablations). These parameters directly determine the per-head temperature scales and residual balancing factors; without such evidence the assumption that the buffer is representative of the full LIBERO task distribution remains unverified and load-bearing for the performance claim.
Authors: We will expand §§3.2–3.3 with the missing details: the calibration buffer consists of 256 randomly sampled unlabeled trajectories drawn from the LIBERO training tasks. We will also add a sensitivity study showing stable performance for buffer sizes 128–512 and held-out calibration ablations demonstrating that the learned scales generalize to unseen tasks. revision: yes
-
Referee: [§4, Table 1] §4, Table 1 (or equivalent results table): the reported 70% relative memory savings are stated at a high level without breakdown by component (weights vs. activations, language backbone vs. DiT head) or comparison against standard PTQ baselines such as GPTQ or SmoothQuant. This detail is necessary to assess whether the selective quantization layout is the primary driver of the savings.
Authors: We accept that a finer-grained breakdown and baseline comparisons are needed. The revised Table 1 and surrounding text will report memory savings separately for weights and activations in both the language backbone and DiT head. We will also add direct comparisons against GPTQ and SmoothQuant applied to the same selective quantization layout, isolating the contribution of our temperature-matching and residual-balancing techniques. revision: yes
Circularity Check
No significant circularity; empirical calibration method with held-out evaluation
full rationale
The paper describes a training-free PTQ framework whose core operations (selective layout, per-head temperature matching, per-layer residual balancing) are determined by fitting scales to a small unlabeled calibration buffer. The headline performance claims consist of measured task success rates on held-out LIBERO tasks for representative VLA models; these rates are external observables and do not reduce by construction to the fitted scale values. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- per-head attention temperature scales
- per-layer residual balancing factors
axioms (1)
- domain assumption Selective integerization of linear layers while retaining attention projections in floating point preserves the original operator schedule and performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
attention temperature matching... αraw = Std(LT)/Std(LQ)... output head balancing... βraw(l) = RMS(ZT,l)/RMS(ZQ,l)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
selective quantization layout... keeping attention projections in floating point
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...
Reference graph
Works this paper leans on
-
[1]
Quarot: Outlier-free 4- bit inference in rotated llms,
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456, 2024. 3
-
[2]
Omnisat: Self-supervised modality fusion for earth observation
Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3
work page 2024
-
[3]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 2
work page 2025
-
[8]
Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with struc- tured binary llms.arXiv preprint arXiv:2408.01803, 2024. 3
-
[9]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 3
work page 2023
-
[10]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025. 3
-
[12]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision–language–action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 3
work page 2025
-
[14]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1
work page 2022
-
[16]
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low- rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024. 3
-
[17]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Infor- mation Processing Systems, 37:87766–87800, 2024. 2, 3, 4
work page 2024
-
[19]
Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, and Zhenan Sun. Quantization meets dllms: A systematic study of post-training quantization for diffusion llms.arXiv preprint arXiv:2508.14896, 2025. 3
-
[20]
Efficient diffusion language models: A comprehensive survey.Authorea Preprints, 2026
Haokun Lin, Xinle Jia, Shaozhen Liu, Shujun Xia, Weitao Huang, Haobo Xu, Junyang Li, Yicheng Xiao, Xingrun Xing, Ziyu Guo, et al. Efficient diffusion language models: A comprehensive survey.Authorea Preprints, 2026. 3
work page 2026
-
[21]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of machine learning and systems, 6:87– 100, 2024. 3
work page 2024
-
[22]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 2, 6
work page 2023
-
[24]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision trans- former.Advances in Neural Information Processing Systems, 34:28092–28103, 2021. 2
work page 2021
-
[26]
Up or down? adap- tive rounding for post-training quantization
Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Chris- tos Louizos, and Tijmen Blankevoort. Up or down? adap- tive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR,
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[29]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Moritz Reuss, Hongyi Zhou, Marcel R ¨uhle, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025. 3
-
[31]
Omniquant: Omnidirectionally calibrated quantization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth In- ternational Conference on Learning Representations, 2023. 3
work page 2023
-
[32]
Efficient diffusion models: A survey
Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, et al. Efficient diffusion models: A survey. arXiv preprint arXiv:2502.06805, 2025. 4
-
[33]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[35]
Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024. 3
-
[36]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.Advances in neural information processing systems, 37:124420–124450, 2024. 2
work page 2024
-
[38]
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 1, 3
work page 2025
-
[39]
Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. Ptq4dit: Post-training quantization for diffu- sion transformers.Advances in neural information process- ing systems, 37:62732–62755, 2024. 3, 12
work page 2024
-
[40]
Smoothquant: Accurate and effi- cient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In International conference on machine learning, pages 38087– 38099. PMLR, 2023. 2, 3
work page 2023
-
[41]
Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025. 1, 3
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024. 3
-
[44]
Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. Lrq-dit: Log-rotation post-training quantization of diffusion transformers for text-to-image generation.arXiv preprint arXiv:2508.03485, 2025. 3
-
[45]
Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efficientvla: Training-free acceleration and com- pression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025. 1, 3
-
[46]
Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089, 2023. 3
-
[47]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4
work page 2023
-
[48]
Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 1, 3
-
[49]
Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for im- age and video generation.arXiv preprint arXiv:2406.02540,
-
[50]
Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, and Yu Wang. Mixdq: Memory-efficient few-step text-to-image dif- fusion models with metric-decoupled mixed precision quan- tization. InEuropean Conference on Computer Vision, pages 285–302. Springer, 2024. 3, 12
work page 2024
-
[51]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision- language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025. 1
-
[54]
Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning
Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, et al. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning. arXiv preprint arXiv:2506.06072, 2025. 3
-
[55]
RoboDreamer: Learning Compositional World Models for Robot Imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compo- sitional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[56]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 A. General Quantization Formulations Post-training quantization (PTQ) [26, 39, 50] r...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.