Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots
Pith reviewed 2026-07-03 10:34 UTC · model grok-4.3
The pith
Embodied.cpp supplies a single C++ runtime that runs vision-language-action and world-action models on heterogeneous robots via one five-layer abstraction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embodied.cpp captures a shared execution path from representative VLA models and WAMs and organizes it into five layers—input adapters, sequence builders, backbone execution, head plugins, and deployment adapters—that together deliver the runtime contract required for embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O.
What carries the argument
Five-layer structure (input adapters, sequence builders, backbone execution, head plugins, deployment adapters) that abstracts the common components of embodied models for a single backend.
If this is right
- VLA models achieve 100.0 percent and 91.0 percent task success rates in closed-loop execution.
- WAM Transformer blocks reduce memory from 312.2 MiB to 88.1 MiB.
- One backend abstraction supports deployment on heterogeneous devices, robots, and simulators.
- Modular multi-rate execution and extensible I/O are available without per-model Python stacks.
Where Pith is reading between the lines
- The same five-layer split could be tested on future model families to check whether the abstraction remains stable.
- Hardware vendors could supply only the deployment-adapter layer while keeping the rest unchanged.
- Simulator loops could reuse the identical runtime binary to shorten the path from model training to robot deployment.
Load-bearing premise
The architectural analysis of representative VLA and WAM models has identified a shared execution path that the five-layer structure can capture for general use across devices and robots.
What would settle it
A new embodied model whose required execution steps fall outside the five layers and produces either lower task success rates than the reported 100 percent and 91 percent or higher memory use than the reported 88.1 MiB on equivalent benchmarks.
read the original abstract
Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Embodied.cpp, a portable C++ inference runtime for embodied AI models spanning vision-language-action (VLA) models and world-action models (WAMs). It claims that an architectural analysis of representative models reveals a shared execution path, which is captured in a five-layer structure (input adapters, sequence builders, backbone execution, head plugins, deployment adapters). The runtime supports modular multi-rate execution, latency-first fused inference, and extensible I/O for closed-loop control on heterogeneous devices. Evaluation reports 100.0% and 91.0% task success rates on HY-VLA and pi0.5, respectively, plus a memory reduction from 312.2 MiB to 88.1 MiB on a LingBot-VA Transformer block benchmark, arguing that the system improves deployment efficiency while preserving accuracy across architectures.
Significance. If the five-layer abstraction proves generalizable, Embodied.cpp would address a practical fragmentation problem in embodied AI deployment by supplying a unified C++ backend that replaces model-specific Python stacks and robot glue code. The concrete empirical results (high success rates under closed-loop conditions and substantial memory savings) provide direct evidence of utility for latency-sensitive, multi-rate inference on edge hardware. As an engineering contribution with explicit support for extensible operators, the work aligns with needs in robotics for reproducible deployment tools.
major comments (2)
- [Abstract] Abstract: The central claim that the five-layer decomposition captures a 'shared execution path' across VLA models and WAMs rests on an 'architectural analysis of representative VLA models and WAMs,' but the manuscript supplies no enumeration of the models examined, no discussion of architectural variations (e.g., differing tokenization, multi-rate sensor fusion, or non-transformer backbones), and no evidence that the layers suffice without model-specific extensions. This analysis is load-bearing for the portability claim.
- [Evaluation] Evaluation (results paragraph): The reported 100.0% and 91.0% success rates and the 312.2 MiB to 88.1 MiB memory reduction are presented without experimental protocol details, baseline comparisons, error bars, dataset specifications, or hardware configurations. Evaluation is restricted to two VLA models and one preliminary WAM block, which limits the ability to verify that the abstraction generalizes rather than fitting only the chosen examples.
minor comments (1)
- [Abstract] Abstract: Provide full citations or expanded names for HY-VLA, pi0.5, and LingBot-VA on first mention to improve accessibility for readers unfamiliar with the specific models.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and will incorporate revisions to strengthen the presentation of our architectural analysis and evaluation details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the five-layer decomposition captures a 'shared execution path' across VLA models and WAMs rests on an 'architectural analysis of representative VLA models and WAMs,' but the manuscript supplies no enumeration of the models examined, no discussion of architectural variations (e.g., differing tokenization, multi-rate sensor fusion, or non-transformer backbones), and no evidence that the layers suffice without model-specific extensions. This analysis is load-bearing for the portability claim.
Authors: We agree with the referee that the manuscript would be strengthened by providing an explicit enumeration of the models analyzed and a discussion of how the five-layer structure accommodates variations. The architectural analysis was conducted on HY-VLA and pi0.5 as representative VLA models, and the LingBot-VA Transformer block for WAMs. These were selected for their diversity in tokenization, sensor inputs, and backbone architectures. In the revision, we will add a dedicated paragraph or subsection detailing these models, the observed variations (including multi-rate sensor fusion and transformer vs. other backbones if applicable), and evidence from our implementation that the layers suffice without additional model-specific code. This will include mapping examples for each model to the five layers. revision: yes
-
Referee: [Evaluation] Evaluation (results paragraph): The reported 100.0% and 91.0% success rates and the 312.2 MiB to 88.1 MiB memory reduction are presented without experimental protocol details, baseline comparisons, error bars, dataset specifications, or hardware configurations. Evaluation is restricted to two VLA models and one preliminary WAM block, which limits the ability to verify that the abstraction generalizes rather than fitting only the chosen examples.
Authors: We acknowledge that the current evaluation section lacks sufficient details on the experimental setup. We will revise the manuscript to include comprehensive protocol descriptions, including the specific tasks, datasets, hardware platforms used for the success rate measurements, and the benchmark setup for the memory reduction. Where possible, we will add baseline comparisons to standard inference frameworks and report on multiple runs if error bars are available from our experiments. While the evaluation is indeed limited to these models, they were chosen as they cover key embodied AI paradigms, and we will explicitly state the rationale for this selection and note that broader validation is an important direction for future work. This should allow readers to better assess the generalizability of the abstraction. revision: yes
Circularity Check
No circularity; engineering implementation with direct measurements
full rationale
The paper presents Embodied.cpp as a C++ runtime whose five-layer organization follows from architectural analysis of VLA/WAM models. No equations, fitted parameters, predictions, or self-citations appear in the provided text. Results are reported as measured task success rates and memory reductions rather than derived quantities. The central claim therefore does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math C++ code can be compiled and executed portably across heterogeneous edge devices and robot platforms via standard build tools and hardware abstraction.
Reference graph
Works this paper leans on
-
[1]
OpenVLA: An Open-Source Vision-Language-Action Model
Karl Pertsch et al. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246, 2024. URL:https://arxiv.org/abs/2406.09246
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black et al. Pi0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. URL:https://arxiv.org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence et al. Pi0.5: A Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054, 2025. URL:https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. URL:https://arxiv.org/abs/2503.14734
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Causal World Modeling for Robot Control
L. Li et al. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998, 2026. URL: https://arxiv.org/abs/2601.21998
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y. Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y.-G. Jiang. World Action Models: The Next Frontier in Embodied AI.arXiv preprint arXiv:2605.12090, 2026. Submitted May 12, 2026. URL: https://arxiv.org/abs/2605. 12090
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, and Guozi Liu. Stop wandering: Efficient vision-language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026. 9
-
[8]
Hugging Face. LeRobot. GitHub repository, 2026. Accessed June 17, 2026. URL: https:// github.com/huggingface/lerobot
2026
-
[9]
Open X-Embodiment
Open X-Embodiment Collaboration. Open X-Embodiment. Project website, 2026. Accessed June 17, 2026. URL:https://robotics-transformer-x.github.io/
2026
-
[10]
ManiSkill
ManiSkill Team. ManiSkill. Project website, 2026. Accessed June 17, 2026. URL: https: //maniskill.ai/
2026
-
[11]
LIBERO Team. LIBERO. Project website, 2026. Accessed June 17, 2026. URL: https:// libero-project.github.io/
2026
-
[12]
Isaac Sim
NVIDIA. Isaac Sim. Product website, 2026. Accessed June 17, 2026. URL:https://developer. nvidia.com/isaac/sim
2026
-
[13]
llama.cpp
Georgi Gerganov et al. llama.cpp. GitHub repository, 2026. 2023–2026. URL:https://github. com/ggml-org/llama.cpp
2026
-
[14]
ONNX Runtime Documentation
Microsoft. ONNX Runtime Documentation. Official documentation, 2026. Accessed June 17, 2026. URL:https://onnxruntime.ai/docs/
2026
-
[15]
LMSYS Org. SGLang. Official documentation and repository, 2026. Accessed June 17, 2026. URL: https://docs.sglang.io/
2026
-
[16]
vLLM-Omni
vLLM Project. vLLM-Omni. Official documentation and repository, 2026. Accessed June 17, 2026. URL:https://docs.vllm.ai/projects/vllm-omni/en/latest/
2026
-
[17]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.arXiv preprint arXiv:2307.15818, 2023. URL:https://arxiv.org/abs/2307.15818
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team et al. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213, 2024. URL:https://arxiv.org/abs/2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation
MuseVLA:AnAdaptiveMultimodalSensingVision-Language-ActionModelforRoboticManipulation. arXiv preprint arXiv:2606.17598, 2026. URL:https://arxiv.org/abs/2606.17598
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
L.X.Shietal. HiRobot: Open-EndedInstructionFollowingwithHierarchicalVision-Language-Action Models.arXiv preprint arXiv:2502.19417, 2025. URL:https://arxiv.org/abs/2502.19417
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
arXiv preprint arXiv:2602.04315, 2026
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Plan- ning. arXiv preprint arXiv:2602.04315, 2026. URL:https://arxiv.org/abs/2602.04315
-
[23]
URL:https://arxiv.org/abs/2403.01823
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
URL:https://arxiv.org/abs/2510.03342
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning. arXiv preprint arXiv:2506.01953, 2025. URL:https://arxiv.org/abs/2506.01953. 10
-
[27]
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model. arXiv preprint arXiv:2606.12105, 2026. URL:https://arxiv.org/abs/2606.12105
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Tenenbaum, Dale Schuurmans, and Pieter Abbeel
Y. Du et al. Learning Universal Policies via Text-Guided Video Generation.arXiv preprint arXiv:2302.00111, 2023. URL:https://arxiv.org/abs/2302.00111
-
[29]
WorldVLA: Towards Autoregressive Action World Model
WorldVLA: Towards Autoregressive Action World Model. arXiv preprint arXiv:2506.21539, 2025. URL:https://arxiv.org/abs/2506.21539
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
World Action Models are Zero-shot Policies
World Action Models are Zero-shot Policies. arXiv preprint arXiv:2602.15922, 2026. URL:https: //arxiv.org/abs/2602.15922
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM: Do World Action Models Need Test-time Future Imagination? arXiv preprint arXiv:2603.16666, 2026. URL:https://arxiv.org/abs/2603.16666
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv preprint arXiv:2601.16163, 2026. URL:https://arxiv.org/abs/2601.16163
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Unified Video Action Model. arXiv preprint arXiv:2503.00200, 2025. URL:https://arxiv.org/ abs/2503.00200
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
arXiv preprint arXiv:2606.15768, 2026
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies. arXiv preprint arXiv:2606.15768, 2026. URL:https://arxiv.org/abs/2606.15768
-
[36]
URL:https://arxiv.org/abs/2605.00078
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation
Hao Wu, Xuejin Tian, Minghao Li, Yunxin Liu, Ganesh Ananthanarayanan, Fengyuan Xu, and Sheng Zhong. Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation. InProceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 229–241, 2021
2021
-
[38]
Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices
Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020
2020
-
[39]
H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025
Fei Zeng, Feng Lyu, Hao Wu, Zhanxi Li, Shucheng Li, Fengyuan Xu, and Yaoxue Zhang. H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025
2025
-
[40]
Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024
Borui Li, Tianen Liu, Weilong Wang, Chengqing Zhao, and Shuai Wang. Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024
2024
-
[41]
Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling
Borui Li, Tiange Xia, and Shuai Wang. Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling. In2025 62nd ACM/IEEE Design Automation Conference (DAC), pages 1–7. IEEE, 2025
2025
-
[42]
Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization
Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23400–23410, 2025. 11
2025
-
[43]
K. D. Nguyen, H. T. Ho, C. T. Nguyen, T. Q. Duong, L. D. Le, D. M. H. Nguyen, V. A. Ngo, and A. T. Le. vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models.arXiv preprint arXiv:2606.08094, 2026. Submitted June 6, 2026. URL: https://arxiv.org/abs/2606. 08094
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
I. Gim, Z. Ma, S.-S. Lee, and L. Zhong. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the 31st ACM Symposium on Operating Systems Principles (SOSP),
-
[45]
URL:https://doi.org/10.1145/3731569.3764814
-
[46]
L. Su. Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low- Latency, Small-Batch, On-Device Physical-AI Serving.arXiv preprint arXiv:2606.20537, 2026. URL: https://arxiv.org/abs/2606.20537. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.