pith. sign in

arxiv: 2606.03724 · v1 · pith:I6ZEWYBKnew · submitted 2026-06-02 · 💻 cs.CR

Same Weights, Different Robot: A Deployment Safety View of VLA Policies

Pith reviewed 2026-06-28 09:28 UTC · model grok-4.3

classification 💻 cs.CR
keywords VLA policiesdeployment safetyaction normalizationmetadata mismatchexecutable policyLIBERO benchmarkquantile normalization
0
0 comments X

The pith

Identical VLA checkpoints can be executable-inequivalent due to action metadata differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action policies are often assumed to be defined solely by their weights, prompt, and benchmark. However, robot execution depends on action representation, metadata-selected unnormalizer, and controller conventions, creating a deployment safety gap. The paper formalizes this as an executable policy specification problem where identical checkpoints can produce different physical actions. For quantile-style normalization, a closed-form transform and ExecSpec certificate detect semantic drift without running the model. Replay experiments on LIBERO show that metadata substitution can drastically reduce success rates, supporting the need to check action-space metadata before rollout.

Core claim

We formalize the gap as an executable policy specification problem: a VLA policy includes the learned model, action representation, metadata-selected unnormalizer, and controller-facing conventions. Under this view, identical checkpoints can be executable-inequivalent. For quantile-style action normalization, we derive a closed-form metadata mismatch transform and an ExecSpec certificate that measures action-space semantic drift without model inference or rollout.

What carries the argument

The ExecSpec certificate that measures action-space semantic drift from metadata mismatches in quantile-style action normalization without requiring model inference or rollout.

Load-bearing premise

The replay-based substitution experiments on LIBERO benchmarks accurately indicate a general deployment safety issue.

What would settle it

Finding that different metadata keys produce identical unnormalized action sequences and success rates on the same checkpoint would falsify the claim of executable inequivalence.

read the original abstract

Vision-language-action (VLA) policies are often treated as checkpoint-defined objects: if the weights, prompt, and benchmark suite match, the deployment is assumed to be the same policy. Robot execution breaks this assumption because the same normalized model output can become a different physical action after action unnormalization and controller conventions are applied. This creates a deployment-safety gap: safety review can certify the checkpoint while missing the executable robot policy that reaches the controller. We formalize this gap as an executable policy specification problem: a VLA policy includes the learned model, action representation, metadata-selected unnormalizer, and controller-facing conventions. Under this view, identical checkpoints can be executable-inequivalent. For quantile-style action normalization, we derive a closed-form metadata mismatch transform and an ExecSpec certificate that measures action-space semantic drift without model inference or rollout. On LIBERO-Goal replay, substituting a plausible sibling metadata key yields mean drift 0.199 over six non-gripper action dimensions and reduces success from 28/28 to 2/28 under full substitution. On LIBERO-Spatial replay, the same substituted key reduces success from 26/26 to 0/26. The same full-substitution protocol gives 0/28 success for all four Object substitutions and 0/23 or 1/23 success on Long. Identity-key, replay-validity, no-op filtering, raw-vs-correct replay, mask/gripper, synthetic upper-bound, and OpenVLA-style unnormalizer interface checks rule out several simpler explanations. These results do not certify closed-loop or hardware safety. They support a narrower deployment-safety view: action-space metadata is part of the executable policy and should be checked before rollout.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that VLA policies are not fully specified by model weights alone, since action unnormalization metadata and controller conventions determine the executable robot policy. Identical checkpoints can therefore be executable-inequivalent. For quantile-style normalization the authors derive a closed-form metadata mismatch transform and introduce an ExecSpec certificate that quantifies action-space semantic drift without model inference or rollout. Replay substitution experiments on LIBERO-Goal and LIBERO-Spatial report mean drift of 0.199 across six non-gripper dimensions and success-rate drops (28/28 to 2/28; 26/26 to 0/26) under full substitution of a plausible sibling metadata key. Multiple controls (identity-key, replay-validity, no-op filtering, raw-vs-correct replay, mask/gripper, synthetic upper-bound, OpenVLA-style interface) rule out simpler explanations. The results support treating action metadata as part of the executable specification that should be checked before rollout, while explicitly stating that the replay protocol does not certify closed-loop or hardware safety.

Significance. If the central claim holds, the work highlights an under-appreciated deployment-safety consideration for VLA policies: metadata must be included in the policy specification. The closed-form derivation and the model-free ExecSpec certificate are concrete strengths that allow drift detection without inference or rollouts. The manuscript carefully scopes its conclusions to the replay setting, which prevents overgeneralization and directly addresses the bridging-assumption concern raised in the stress-test note. The empirical measurements on standard benchmarks provide direct, falsifiable evidence of the phenomenon.

minor comments (3)
  1. The abstract lists the controls that rule out simpler explanations but does not indicate where in the manuscript the detailed results of each control appear; a short summary table or dedicated paragraph would improve traceability.
  2. The ExecSpec certificate is introduced as a model-free measure, yet the abstract provides no equation or pseudocode; including the precise definition (even if only referenced) would aid reproducibility.
  3. Success rates are reported as exact fractions (28/28, 2/28) without accompanying trial counts, variance, or statistical tests; adding these details would strengthen the presentation of the quantitative results.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, including the recognition of the closed-form metadata mismatch transform, the ExecSpec certificate, and the careful scoping to replay-based evidence. The recommendation of minor revision is noted; we will incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No circularity: closed-form transform follows directly from quantile definition; results are independent empirical measurements

full rationale

The paper's central derivation is an algebraic closed-form transform obtained directly from the standard definition of quantile-style action normalization; this is ordinary mathematical expansion rather than any self-referential loop or fitted input renamed as prediction. The LIBERO replay substitution results are direct empirical observations of success-rate changes under metadata substitution and are not quantities generated by the paper's own equations. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that quantile-style normalization applies and introduces the ExecSpec certificate as a new measurement tool without external falsifiable evidence.

axioms (1)
  • domain assumption Action normalization follows a quantile-style process
    The closed-form metadata mismatch transform is derived specifically under this normalization type.
invented entities (1)
  • ExecSpec certificate no independent evidence
    purpose: Measures action-space semantic drift without requiring model inference or rollout
    Newly defined in the paper to quantify the executable policy gap.

pith-pipeline@v0.9.1-grok · 5839 in / 1263 out tokens · 32328 ms · 2026-06-28T09:28:42.995983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 13 linked inside Pith

  1. [1]

    D.; Chernova, S.; Veloso, M.; and Browning, B

    Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A Survey of Robot Learning from Demonstration. Robotics and Autonomous Systems, 57(5): 469--483

  2. [2]

    Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; Florence, P.; Fu, C.; Gonzalez Arenas, M.; Gopalakrishnan, K.; Han, K.; Hausman, K.; Herzog, A.; Hsu, J.; Ichter, B.; Irpan, A.; Joshi, N.; Julian, R.; Kalashnikov, D.; Kuang, Y.; Leal, I.; Lee, L.; Lee, T.-W. E.; Levine, S.; Lu, Y.; Mi...

  3. [3]

    W.; Yuan, Z.; Zhou, S.; Panerati, J.; and Schoellig, A

    Brunke, L.; Greeff, M.; Hall, A. W.; Yuan, Z.; Zhou, S.; Panerati, J.; and Schoellig, A. P. 2022. Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning. Annual Review of Control, Robotics, and Autonomous Systems, 5: 411--444

  4. [4]

    Cadene, R.; Aliberts, S.; Capuano, F.; Aractingi, M.; Zouitine, A.; Kooijmans, P.; Choghari, J.; Russi, M.; Pascal, C.; Palma, S.; Shukor, M.; Moss, J.; Soare, A.; Aubakirova, D.; Lhoest, Q.; Gallouedec, Q.; and Wolf, T. 2026. LeRobot : An Open-Source Library for End-to-End Robot Learning. arXiv:2602.22818

  5. [5]

    Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; and Song, S. 2023. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Robotics: Science and Systems

  6. [6]

    Chi, C.; Xu, Z.; Pan, C.; Cousineau, E.; Burchfiel, B.; Feng, S.; Tedrake, R.; and Song, S. 2024. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv:2402.10329

  7. [7]

    D.; Krishna, R.; Fox, D.; and Yu, Y

    Choi, S.; Lee, Y.; Park, Y.; Kim, C. D.; Krishna, R.; Fox, D.; and Yu, Y. 2026. vla-eval : A Unified Evaluation Harness for Vision-Language-Action Models. arXiv:2603.13966

  8. [8]

    Dasari, S.; Ebert, F.; Tian, S.; Nair, S.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Levine, S.; and Finn, C. 2020. RoboNet : Large-Scale Multi-Robot Learning. In Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, 885--897. PMLR

  9. [9]

    W.; Wallach, H.; Daum \'e III, H.; and Crawford, K

    Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J. W.; Wallach, H.; Daum \'e III, H.; and Crawford, K. 2021. Datasheets for Datasets. Communications of the ACM, 64(12): 86--92

  10. [10]

    S.; Zhang, J.; Tang, S.; and Xiang, Y

    Huang, A. S.; Zhang, J.; Tang, S.; and Xiang, Y. 2026. VLA-REPLICA : A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models. arXiv:2605.20774

  11. [11]

    Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and Abbeel, P. 2017. Adversarial Attacks on Neural Network Policies. arXiv:1702.02284

  12. [12]

    Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Sreekanth, K.; Fang, K.; Schaal, S.; Finn, C.; and Levine, S. 2024. DROID : A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv:2403.12945

  13. [13]

    Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; Vuong, Q.; Kollar, T.; Burchfiel, B.; Tedrake, R.; Sadigh, D.; Levine, S.; Liang, P.; and Finn, C. 2024. OpenVLA : An Open-Source Vision-Language-Action Model. arXiv:2406.09246

  14. [14]

    Li, Q.; Liang, Y.; Wang, Z.; Luo, L.; Chen, X.; et al. 2024. CogACT : A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation. arXiv:2411.19650

  15. [15]

    Liu, B.; Zhu, Y.; Gao, C.; Feng, Y.; Liu, Q.; Zhu, Y.; and Stone, P. 2023. LIBERO : Benchmarking Knowledge Transfer for Lifelong Robot Learning. In Advances in Neural Information Processing Systems, volume 36

  16. [16]

    Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Fei-Fei, L.; Savarese, S.; Zhu, Y.; and Mart \'i n-Mart \'i n, R. 2022. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, 1678--1690. PMLR

  17. [17]

    D.; and Gebru, T

    Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I. D.; and Gebru, T. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220--229

  18. [18]

    L.; Chen, L

    Octo Model Team ; Ghosh, D.; Walke, H.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; Luo, J.; Tan, Y. L.; Chen, L. Y.; Sanketi, P.; Vuong, Q.; Xiao, T.; Sadigh, D.; Finn, C.; and Levine, S. 2024. Octo : An Open-Source Generalist Robot Policy. arXiv:2405.12213

  19. [19]

    Open X-Embodiment Collaboration ; O'Neill, A.; Rehman, A.; Gupta, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; Tung, A.; Bewley, A.; Herzog, A.; Irpan, A.; Khazatsky, A.; Rai, A.; Gupta, A.; Wang, A.; Kolobov, A.; Singh, A.; Garg, A.; Kembhavi, A.; Xie, A.; Brohan, A.; Finn, C.; Ichter, B.; Levine, S...

  20. [20]

    Physical Intelligence . 2026. OpenPI Normalization Statistics Documentation. https://github.com/Physical-Intelligence/openpi. Docs/norm\_stats.md

  21. [21]

    Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivi \`e re, V.; Beygelzimer, A.; d'Alch \'e Buc, F.; Fox, E.; and Larochelle, H. 2021. Improving Reproducibility in Machine Learning Research: A Report from the NeurIPS 2019 Reproducibility Program. Journal of Machine Learning Research, 22(164): 1--20

  22. [22]

    StarVLA Community . 2026. StarVLA : A Lego-like Codebase for Vision-Language-Action Model Developing. arXiv:2604.05014

  23. [23]

    Zhang, J.; and Cho, K. 2017. Query-Efficient Imitation Learning for End-to-End Simulated Driving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31

  24. [24]

    Z.; Kumar, V.; Levine, S.; and Finn, C

    Zhao, T. Z.; Kumar, V.; Levine, S.; and Finn, C. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv:2304.13705

  25. [25]

    Zhu, Y.; Wong, J.; Mandlekar, A.; Mart \'i n-Mart \'i n, R.; Joshi, A.; Lin, K.; Maddukuri, A.; Nasiriany, S.; and Zhu, Y. 2020. robosuite : A Modular Simulation Framework and Benchmark for Robot Learning. arXiv:2009.12293