Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

BoRui Li; Chuanyou Li; Chuyu Han; Hao Wu; Ling Xu; Sheng Zhong; Shiqi Jiang; Shuai Wang; Ting Cao

arxiv: 2607.02501 · v1 · pith:6AWD7DZPnew · submitted 2026-07-02 · 💻 cs.RO · cs.CV· cs.OS

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Ling Xu , Chuyu Han , Borui Li , Hao Wu , Shiqi Jiang , Ting Cao , Chuanyou Li , Sheng Zhong

show 1 more author

Shuai Wang

This is my paper

Pith reviewed 2026-07-03 10:34 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.OS

keywords Embodied AIInference RuntimeVision-Language-Action ModelsWorld-Action ModelsPortable C++ DeploymentHeterogeneous RobotsClosed-Loop Control

0 comments

The pith

Embodied.cpp supplies a single C++ runtime that runs vision-language-action and world-action models on heterogeneous robots via one five-layer abstraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Embodied.cpp as a portable C++ inference runtime built for embodied AI models that must run inside closed-loop robot control. It starts from an analysis of VLA and WAM architectures and extracts a shared execution path that the authors organize into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. This structure supplies modular multi-rate scheduling, latency-first fused inference, and extensible operator and I/O support while replacing model-specific Python stacks and robot glue code. Evaluations on two VLA models and one WAM benchmark show task success rates of 100 percent and 91 percent together with a reduction in block memory from 312.2 MiB to 88.1 MiB. The central claim is that the same backend abstraction suffices for deployment across devices, robots, and simulators without loss of accuracy.

Core claim

Embodied.cpp captures a shared execution path from representative VLA models and WAMs and organizes it into five layers—input adapters, sequence builders, backbone execution, head plugins, and deployment adapters—that together deliver the runtime contract required for embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O.

What carries the argument

Five-layer structure (input adapters, sequence builders, backbone execution, head plugins, deployment adapters) that abstracts the common components of embodied models for a single backend.

If this is right

VLA models achieve 100.0 percent and 91.0 percent task success rates in closed-loop execution.
WAM Transformer blocks reduce memory from 312.2 MiB to 88.1 MiB.
One backend abstraction supports deployment on heterogeneous devices, robots, and simulators.
Modular multi-rate execution and extensible I/O are available without per-model Python stacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same five-layer split could be tested on future model families to check whether the abstraction remains stable.
Hardware vendors could supply only the deployment-adapter layer while keeping the rest unchanged.
Simulator loops could reuse the identical runtime binary to shorten the path from model training to robot deployment.

Load-bearing premise

The architectural analysis of representative VLA and WAM models has identified a shared execution path that the five-layer structure can capture for general use across devices and robots.

What would settle it

A new embodied model whose required execution steps fall outside the five layers and produces either lower task success rates than the reported 100 percent and 91 percent or higher memory use than the reported 88.1 MiB on equivalent benchmarks.

read the original abstract

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embodied.cpp gives a concrete five-layer C++ runtime for robot deployment but the experiments are too narrow to back the generality claim.

read the letter

The main takeaway is that this paper ships a new portable C++ inference runtime aimed at embodied models. It organizes execution into five layers—input adapters, sequence builders, backbone execution, head plugins, and deployment adapters—drawn from analysis of VLA and WAM architectures. The runtime targets multi-rate closed-loop control, batch-1 latency on edge hardware, and extensible I/O instead of request-response serving.

The work does address a practical gap. Standard runtimes assume different contracts, so robot teams often end up with model-specific Python glue. The authors report 100% and 91% task success on HY-VLA and pi0.5 plus a drop from 312 MiB to 88 MiB on a LingBot-VA block. That shows the layers can be implemented and run on at least these cases.

The soft spots sit in the evaluation. Only two VLA models and one preliminary WAM block are shown, with no baselines, no hardware specs, no error bars, and no protocol details. The claim that the five layers capture a shared path across representative models rests on an analysis that is not enumerated or tested against variations like different tokenization or non-transformer backbones. Without that, it is unclear whether the abstraction generalizes or fits the chosen examples.

This is for engineers who deploy models on heterogeneous robots and need one backend instead of scattered Python stacks. A reader in that group can extract the layer design and the reported numbers even if they re-run the tests themselves.

I would send it to peer review. The artifact is real and the problem statement is grounded, but the authors need to add comparative data and more transparency on the model survey before the generality claim can be assessed.

Referee Report

2 major / 1 minor

Summary. The paper introduces Embodied.cpp, a portable C++ inference runtime for embodied AI models spanning vision-language-action (VLA) models and world-action models (WAMs). It claims that an architectural analysis of representative models reveals a shared execution path, which is captured in a five-layer structure (input adapters, sequence builders, backbone execution, head plugins, deployment adapters). The runtime supports modular multi-rate execution, latency-first fused inference, and extensible I/O for closed-loop control on heterogeneous devices. Evaluation reports 100.0% and 91.0% task success rates on HY-VLA and pi0.5, respectively, plus a memory reduction from 312.2 MiB to 88.1 MiB on a LingBot-VA Transformer block benchmark, arguing that the system improves deployment efficiency while preserving accuracy across architectures.

Significance. If the five-layer abstraction proves generalizable, Embodied.cpp would address a practical fragmentation problem in embodied AI deployment by supplying a unified C++ backend that replaces model-specific Python stacks and robot glue code. The concrete empirical results (high success rates under closed-loop conditions and substantial memory savings) provide direct evidence of utility for latency-sensitive, multi-rate inference on edge hardware. As an engineering contribution with explicit support for extensible operators, the work aligns with needs in robotics for reproducible deployment tools.

major comments (2)

[Abstract] Abstract: The central claim that the five-layer decomposition captures a 'shared execution path' across VLA models and WAMs rests on an 'architectural analysis of representative VLA models and WAMs,' but the manuscript supplies no enumeration of the models examined, no discussion of architectural variations (e.g., differing tokenization, multi-rate sensor fusion, or non-transformer backbones), and no evidence that the layers suffice without model-specific extensions. This analysis is load-bearing for the portability claim.
[Evaluation] Evaluation (results paragraph): The reported 100.0% and 91.0% success rates and the 312.2 MiB to 88.1 MiB memory reduction are presented without experimental protocol details, baseline comparisons, error bars, dataset specifications, or hardware configurations. Evaluation is restricted to two VLA models and one preliminary WAM block, which limits the ability to verify that the abstraction generalizes rather than fitting only the chosen examples.

minor comments (1)

[Abstract] Abstract: Provide full citations or expanded names for HY-VLA, pi0.5, and LingBot-VA on first mention to improve accessibility for readers unfamiliar with the specific models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and will incorporate revisions to strengthen the presentation of our architectural analysis and evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the five-layer decomposition captures a 'shared execution path' across VLA models and WAMs rests on an 'architectural analysis of representative VLA models and WAMs,' but the manuscript supplies no enumeration of the models examined, no discussion of architectural variations (e.g., differing tokenization, multi-rate sensor fusion, or non-transformer backbones), and no evidence that the layers suffice without model-specific extensions. This analysis is load-bearing for the portability claim.

Authors: We agree with the referee that the manuscript would be strengthened by providing an explicit enumeration of the models analyzed and a discussion of how the five-layer structure accommodates variations. The architectural analysis was conducted on HY-VLA and pi0.5 as representative VLA models, and the LingBot-VA Transformer block for WAMs. These were selected for their diversity in tokenization, sensor inputs, and backbone architectures. In the revision, we will add a dedicated paragraph or subsection detailing these models, the observed variations (including multi-rate sensor fusion and transformer vs. other backbones if applicable), and evidence from our implementation that the layers suffice without additional model-specific code. This will include mapping examples for each model to the five layers. revision: yes
Referee: [Evaluation] Evaluation (results paragraph): The reported 100.0% and 91.0% success rates and the 312.2 MiB to 88.1 MiB memory reduction are presented without experimental protocol details, baseline comparisons, error bars, dataset specifications, or hardware configurations. Evaluation is restricted to two VLA models and one preliminary WAM block, which limits the ability to verify that the abstraction generalizes rather than fitting only the chosen examples.

Authors: We acknowledge that the current evaluation section lacks sufficient details on the experimental setup. We will revise the manuscript to include comprehensive protocol descriptions, including the specific tasks, datasets, hardware platforms used for the success rate measurements, and the benchmark setup for the memory reduction. Where possible, we will add baseline comparisons to standard inference frameworks and report on multiple runs if error bars are available from our experiments. While the evaluation is indeed limited to these models, they were chosen as they cover key embodied AI paradigms, and we will explicitly state the rationale for this selection and note that broader validation is an important direction for future work. This should allow readers to better assess the generalizability of the abstraction. revision: yes

Circularity Check

0 steps flagged

No circularity; engineering implementation with direct measurements

full rationale

The paper presents Embodied.cpp as a C++ runtime whose five-layer organization follows from architectural analysis of VLA/WAM models. No equations, fitted parameters, predictions, or self-citations appear in the provided text. Results are reported as measured task success rates and memory reductions rather than derived quantities. The central claim therefore does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a software systems contribution; it introduces no mathematical free parameters, no new physical entities, and relies only on standard assumptions about C++ compilation and hardware abstraction layers.

axioms (1)

standard math C++ code can be compiled and executed portably across heterogeneous edge devices and robot platforms via standard build tools and hardware abstraction.
The runtime is presented as portable C++ without discussion of platform-specific exceptions or compilation failures.

pith-pipeline@v0.9.1-grok · 5837 in / 1410 out tokens · 41695 ms · 2026-07-03T10:34:22.836842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 27 canonical work pages · 21 internal anchors

[1]

OpenVLA: An Open-Source Vision-Language-Action Model

Karl Pertsch et al. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246, 2024. URL:https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black et al. Pi0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. URL:https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence et al. Pi0.5: A Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054, 2025. URL:https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. URL:https://arxiv.org/abs/2503.14734

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Causal World Modeling for Robot Control

L. Li et al. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998, 2026. URL: https://arxiv.org/abs/2601.21998

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y. Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y.-G. Jiang. World Action Models: The Next Frontier in Embodied AI.arXiv preprint arXiv:2605.12090, 2026. Submitted May 12, 2026. URL: https://arxiv.org/abs/2605. 12090

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Stop wandering: Efficient vision-language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026

Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, and Guozi Liu. Stop wandering: Efficient vision-language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026. 9

work page arXiv 2026
[8]

Hugging Face. LeRobot. GitHub repository, 2026. Accessed June 17, 2026. URL: https:// github.com/huggingface/lerobot

2026
[9]

Open X-Embodiment

Open X-Embodiment Collaboration. Open X-Embodiment. Project website, 2026. Accessed June 17, 2026. URL:https://robotics-transformer-x.github.io/

2026
[10]

ManiSkill

ManiSkill Team. ManiSkill. Project website, 2026. Accessed June 17, 2026. URL: https: //maniskill.ai/

2026
[11]

LIBERO Team. LIBERO. Project website, 2026. Accessed June 17, 2026. URL: https:// libero-project.github.io/

2026
[12]

Isaac Sim

NVIDIA. Isaac Sim. Product website, 2026. Accessed June 17, 2026. URL:https://developer. nvidia.com/isaac/sim

2026
[13]

llama.cpp

Georgi Gerganov et al. llama.cpp. GitHub repository, 2026. 2023–2026. URL:https://github. com/ggml-org/llama.cpp

2026
[14]

ONNX Runtime Documentation

Microsoft. ONNX Runtime Documentation. Official documentation, 2026. Accessed June 17, 2026. URL:https://onnxruntime.ai/docs/

2026
[15]

LMSYS Org. SGLang. Official documentation and repository, 2026. Accessed June 17, 2026. URL: https://docs.sglang.io/

2026
[16]

vLLM-Omni

vLLM Project. vLLM-Omni. Official documentation and repository, 2026. Accessed June 17, 2026. URL:https://docs.vllm.ai/projects/vllm-omni/en/latest/

2026
[17]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.arXiv preprint arXiv:2307.15818, 2023. URL:https://arxiv.org/abs/2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team et al. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213, 2024. URL:https://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA:AnAdaptiveMultimodalSensingVision-Language-ActionModelforRoboticManipulation. arXiv preprint arXiv:2606.17598, 2026. URL:https://arxiv.org/abs/2606.17598

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

L.X.Shietal. HiRobot: Open-EndedInstructionFollowingwithHierarchicalVision-Language-Action Models.arXiv preprint arXiv:2502.19417, 2025. URL:https://arxiv.org/abs/2502.19417

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

arXiv preprint arXiv:2602.04315, 2026

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Plan- ning. arXiv preprint arXiv:2602.04315, 2026. URL:https://arxiv.org/abs/2602.04315

work page arXiv 2026
[23]

URL:https://arxiv.org/abs/2403.01823

work page internal anchor Pith review Pith/arXiv arXiv
[25]

URL:https://arxiv.org/abs/2510.03342

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song

Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning. arXiv preprint arXiv:2506.01953, 2025. URL:https://arxiv.org/abs/2506.01953. 10

work page arXiv 2025
[27]

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model. arXiv preprint arXiv:2606.12105, 2026. URL:https://arxiv.org/abs/2606.12105

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Y. Du et al. Learning Universal Policies via Text-Guided Video Generation.arXiv preprint arXiv:2302.00111, 2023. URL:https://arxiv.org/abs/2302.00111

work page arXiv 2023
[29]

WorldVLA: Towards Autoregressive Action World Model

WorldVLA: Towards Autoregressive Action World Model. arXiv preprint arXiv:2506.21539, 2025. URL:https://arxiv.org/abs/2506.21539

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

World Action Models are Zero-shot Policies

World Action Models are Zero-shot Policies. arXiv preprint arXiv:2602.15922, 2026. URL:https: //arxiv.org/abs/2602.15922

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Fast-WAM: Do World Action Models Need Test-time Future Imagination? arXiv preprint arXiv:2603.16666, 2026. URL:https://arxiv.org/abs/2603.16666

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv preprint arXiv:2601.16163, 2026. URL:https://arxiv.org/abs/2601.16163

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Unified Video Action Model

Unified Video Action Model. arXiv preprint arXiv:2503.00200, 2025. URL:https://arxiv.org/ abs/2503.00200

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

arXiv preprint arXiv:2606.15768, 2026

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies. arXiv preprint arXiv:2606.15768, 2026. URL:https://arxiv.org/abs/2606.15768

work page arXiv 2026
[36]

URL:https://arxiv.org/abs/2605.00078

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation

Hao Wu, Xuejin Tian, Minghao Li, Yunxin Liu, Ganesh Ananthanarayanan, Fengyuan Xu, and Sheng Zhong. Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation. InProceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 229–241, 2021

2021
[38]

Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices

Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020

2020
[39]

H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025

Fei Zeng, Feng Lyu, Hao Wu, Zhanxi Li, Shucheng Li, Fengyuan Xu, and Yaoxue Zhang. H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025

2025
[40]

Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024

Borui Li, Tianen Liu, Weilong Wang, Chengqing Zhao, and Shuai Wang. Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024

2024
[41]

Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling

Borui Li, Tiange Xia, and Shuai Wang. Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling. In2025 62nd ACM/IEEE Design Automation Conference (DAC), pages 1–7. IEEE, 2025

2025
[42]

Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization

Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23400–23410, 2025. 11

2025
[43]

K. D. Nguyen, H. T. Ho, C. T. Nguyen, T. Q. Duong, L. D. Le, D. M. H. Nguyen, V. A. Ngo, and A. T. Le. vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models.arXiv preprint arXiv:2606.08094, 2026. Submitted June 6, 2026. URL: https://arxiv.org/abs/2606. 08094

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

I. Gim, Z. Ma, S.-S. Lee, and L. Zhong. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the 31st ACM Symposium on Operating Systems Principles (SOSP),
[45]

URL:https://doi.org/10.1145/3731569.3764814

work page doi:10.1145/3731569.3764814
[46]

L. Su. Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low- Latency, Small-Batch, On-Device Physical-AI Serving.arXiv preprint arXiv:2606.20537, 2026. URL: https://arxiv.org/abs/2606.20537. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

OpenVLA: An Open-Source Vision-Language-Action Model

Karl Pertsch et al. OpenVLA: An Open-Source Vision-Language-Action Model.arXiv preprint arXiv:2406.09246, 2024. URL:https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black et al. Pi0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. URL:https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence et al. Pi0.5: A Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054, 2025. URL:https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. URL:https://arxiv.org/abs/2503.14734

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Causal World Modeling for Robot Control

L. Li et al. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998, 2026. URL: https://arxiv.org/abs/2601.21998

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y. Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y.-G. Jiang. World Action Models: The Next Frontier in Embodied AI.arXiv preprint arXiv:2605.12090, 2026. Submitted May 12, 2026. URL: https://arxiv.org/abs/2605. 12090

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Stop wandering: Efficient vision-language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026

Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, and Guozi Liu. Stop wandering: Efficient vision-language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026. 9

work page arXiv 2026

[8] [8]

Hugging Face. LeRobot. GitHub repository, 2026. Accessed June 17, 2026. URL: https:// github.com/huggingface/lerobot

2026

[9] [9]

Open X-Embodiment

Open X-Embodiment Collaboration. Open X-Embodiment. Project website, 2026. Accessed June 17, 2026. URL:https://robotics-transformer-x.github.io/

2026

[10] [10]

ManiSkill

ManiSkill Team. ManiSkill. Project website, 2026. Accessed June 17, 2026. URL: https: //maniskill.ai/

2026

[11] [11]

LIBERO Team. LIBERO. Project website, 2026. Accessed June 17, 2026. URL: https:// libero-project.github.io/

2026

[12] [12]

Isaac Sim

NVIDIA. Isaac Sim. Product website, 2026. Accessed June 17, 2026. URL:https://developer. nvidia.com/isaac/sim

2026

[13] [13]

llama.cpp

Georgi Gerganov et al. llama.cpp. GitHub repository, 2026. 2023–2026. URL:https://github. com/ggml-org/llama.cpp

2026

[14] [14]

ONNX Runtime Documentation

Microsoft. ONNX Runtime Documentation. Official documentation, 2026. Accessed June 17, 2026. URL:https://onnxruntime.ai/docs/

2026

[15] [15]

LMSYS Org. SGLang. Official documentation and repository, 2026. Accessed June 17, 2026. URL: https://docs.sglang.io/

2026

[16] [16]

vLLM-Omni

vLLM Project. vLLM-Omni. Official documentation and repository, 2026. Accessed June 17, 2026. URL:https://docs.vllm.ai/projects/vllm-omni/en/latest/

2026

[17] [17]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.arXiv preprint arXiv:2307.15818, 2023. URL:https://arxiv.org/abs/2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team et al. Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213, 2024. URL:https://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA:AnAdaptiveMultimodalSensingVision-Language-ActionModelforRoboticManipulation. arXiv preprint arXiv:2606.17598, 2026. URL:https://arxiv.org/abs/2606.17598

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

L.X.Shietal. HiRobot: Open-EndedInstructionFollowingwithHierarchicalVision-Language-Action Models.arXiv preprint arXiv:2502.19417, 2025. URL:https://arxiv.org/abs/2502.19417

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

arXiv preprint arXiv:2602.04315, 2026

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Plan- ning. arXiv preprint arXiv:2602.04315, 2026. URL:https://arxiv.org/abs/2602.04315

work page arXiv 2026

[22] [23]

URL:https://arxiv.org/abs/2403.01823

work page internal anchor Pith review Pith/arXiv arXiv

[23] [25]

URL:https://arxiv.org/abs/2510.03342

work page internal anchor Pith review Pith/arXiv arXiv

[24] [26]

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song

Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning. arXiv preprint arXiv:2506.01953, 2025. URL:https://arxiv.org/abs/2506.01953. 10

work page arXiv 2025

[25] [27]

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model. arXiv preprint arXiv:2606.12105, 2026. URL:https://arxiv.org/abs/2606.12105

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [28]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Y. Du et al. Learning Universal Policies via Text-Guided Video Generation.arXiv preprint arXiv:2302.00111, 2023. URL:https://arxiv.org/abs/2302.00111

work page arXiv 2023

[27] [29]

WorldVLA: Towards Autoregressive Action World Model

WorldVLA: Towards Autoregressive Action World Model. arXiv preprint arXiv:2506.21539, 2025. URL:https://arxiv.org/abs/2506.21539

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [30]

World Action Models are Zero-shot Policies

World Action Models are Zero-shot Policies. arXiv preprint arXiv:2602.15922, 2026. URL:https: //arxiv.org/abs/2602.15922

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [31]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Fast-WAM: Do World Action Models Need Test-time Future Imagination? arXiv preprint arXiv:2603.16666, 2026. URL:https://arxiv.org/abs/2603.16666

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [32]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv preprint arXiv:2601.16163, 2026. URL:https://arxiv.org/abs/2601.16163

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [33]

Unified Video Action Model

Unified Video Action Model. arXiv preprint arXiv:2503.00200, 2025. URL:https://arxiv.org/ abs/2503.00200

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [34]

arXiv preprint arXiv:2606.15768, 2026

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies. arXiv preprint arXiv:2606.15768, 2026. URL:https://arxiv.org/abs/2606.15768

work page arXiv 2026

[33] [36]

URL:https://arxiv.org/abs/2605.00078

work page internal anchor Pith review Pith/arXiv arXiv

[34] [37]

Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation

Hao Wu, Xuejin Tian, Minghao Li, Yunxin Liu, Ganesh Ananthanarayanan, Fengyuan Xu, and Sheng Zhong. Pecam: Privacy-enhanced video streaming and analytics via securely-reversible transformation. InProceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 229–241, 2021

2021

[35] [38]

Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices

Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource-constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020

2020

[36] [39]

H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025

Fei Zeng, Feng Lyu, Hao Wu, Zhanxi Li, Shucheng Li, Fengyuan Xu, and Yaoxue Zhang. H2o: Heterogeneity-aware hierarchical orchestration for memory-efficient on-device llm inference.IEEE Transactions on Mobile Computing, 2025

2025

[37] [40]

Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024

Borui Li, Tianen Liu, Weilong Wang, Chengqing Zhao, and Shuai Wang. Agent-as-a-service: An ai-native edge computing framework for 6g networks.IEEE Network, 39(2):44–51, 2024

2024

[38] [41]

Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling

Borui Li, Tiange Xia, and Shuai Wang. Infscaler: Enabling efficient ml inference serving on multi- accelerator edge devices via asymmetric auto-scaling. In2025 62nd ACM/IEEE Design Automation Conference (DAC), pages 1–7. IEEE, 2025

2025

[39] [42]

Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization

Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. Mobilora: Accelerating lora-based llm inference on mobile devices via context-aware kv cache optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23400–23410, 2025. 11

2025

[40] [43]

K. D. Nguyen, H. T. Ho, C. T. Nguyen, T. Q. Duong, L. D. Le, D. M. H. Nguyen, V. A. Ngo, and A. T. Le. vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models.arXiv preprint arXiv:2606.08094, 2026. Submitted June 6, 2026. URL: https://arxiv.org/abs/2606. 08094

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [44]

I. Gim, Z. Ma, S.-S. Lee, and L. Zhong. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the 31st ACM Symposium on Operating Systems Principles (SOSP),

[42] [45]

URL:https://doi.org/10.1145/3731569.3764814

work page doi:10.1145/3731569.3764814

[43] [46]

L. Su. Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low- Latency, Small-Batch, On-Device Physical-AI Serving.arXiv preprint arXiv:2606.20537, 2026. URL: https://arxiv.org/abs/2606.20537. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026