pith. machine review for the scientific record. sign in

arxiv: 2605.02525 · v1 · submitted 2026-05-04 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

Andrei-Alexandru Staicu, Bogdan Felician Abaza, Cristian Vasile Doicin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:44 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords semantic autonomyindoor mobile robotshybrid reasoningvision-language modelscross-robot memoryadaptive memoryROS 2 navigationedge computing
0
0 comments X

The pith

A hybrid resolver and shared memory let indoor robots interpret natural language commands without slow vision-model delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that indoor mobile robots can achieve semantic autonomy by combining a fast deterministic resolver for most instructions with vision-language models reserved for ambiguous cases. This approach addresses the practical limits of pure VLM use, which incurs seconds of latency and loses context between sessions. A five-category semantic memory system with explicit scope rules enables knowledge learned on one robot to transfer directly to another as compiled deterministic rules. Validation on physical differential-drive robots shows full accuracy in transfer and resolution across dozens of decisions while running on low-power edge hardware with no training or GPU required.

Core claim

The Semantic Autonomy Stack is a six-layer reference architecture that integrates hybrid deterministic-VLM reasoning with cross-robot adaptive memory. A seven-step parametric resolver processes 88 percent of natural-language navigation instructions in under 0.1 milliseconds without invoking a camera, language model, or GPU. Only genuinely ambiguous commands escalate to VLM reasoning. A five-category semantic memory taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) records VLM-derived insights, promotes them to deterministic rules, and shares the compiled digest across robots, producing a measured 103000-fold latency reduction and 100 percent semantic-1

What carries the argument

The seven-step parametric resolver together with the five-category semantic memory framework that maintains explicit scope taxonomy for global, operator, and robot-specific knowledge.

If this is right

  • Most natural-language navigation commands execute at deterministic speed without waiting for vision-language model inference.
  • Preferences or environment facts discovered on one robot become immediately available as fast rules on every other robot.
  • Concurrent operation of multiple robots becomes feasible on identical low-power hardware without per-robot retraining.
  • The system requires zero training data and no onboard GPU, lowering the hardware barrier for semantic indoor navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same resolver-plus-memory pattern could be applied to other robot morphologies or non-navigation tasks once the parametric rules are extended.
  • Session-to-session memory retention might allow long-term operator-specific behaviors to accumulate without repeated VLM calls.
  • The latency reduction suggests the approach could support real-time semantic coordination among larger robot teams on shared networks.

Load-bearing premise

The seven-step parametric resolver correctly and safely resolves 88 percent of instructions without VLM escalation or errors, and the five-category memory taxonomy generalizes beyond the two tested robots and specific scenarios.

What would settle it

Running the same resolver on a fresh set of 50 instructions from a different environment or third robot and measuring either a drop below 95 percent accuracy or any safety violation in the resolved actions.

Figures

Figures reproduced from arXiv: 2605.02525 by Andrei-Alexandru Staicu, Bogdan Felician Abaza, Cristian Vasile Doicin.

Figure 1
Figure 1. Figure 1: The Semantic Autonomy Stack (SAS): six-layer reference framework and validated Xplorer multi-robot deployment. It summarizes the platform-agnostic SAS layers and their I/O interfaces. L1 - Navigation execution. Localization, route planning, path following, and local obstacle avoidance. L1 receives a target pose or a route specification from L3 and executes it using the platform’s motion capabilities. Stand… view at source ↗
Figure 2
Figure 2. Figure 2: Hybrid reasoning flowchart. Natural language instructions are first processed by the L3a seven-step deterministic cascade; the first matching step returns immediately. If no step produces an unambiguous resolution, the instruction escalates to L3b, which acquires a camera image, constructs a structured prompt, and invokes the VLM. Both paths converge at the executive contract ⟨A, O, V, L⟩, which validates … view at source ↗
Figure 4
Figure 4. Figure 4: Xplorer-B (left) and Xplorer-C(right) robotic platforms. Both platforms run the same high-level ROS 2 Jazzy/Nav2 semantic-navigation stack, including the same navigation graph, static POI annotations, route-server configuration, and L3/L5 reasoning interface. Their low￾level hardware differs in compute distribution, drivetrain electronics, encoder type, and IMU availability, as summarized in view at source ↗
Figure 5
Figure 5. Figure 5: Navigation graph topology for the FIIR corridor environment. The graph comprises 24 nodes and 60 directed edges (30 bidirectional pairs). POI nodes (bold border) are color-coded by semantic category: blue - infrastructure, red - safety, green - environmental. Junction-only nodes are shown in grey view at source ↗
Figure 6
Figure 6. Figure 6: Live operational view used during the experimental sessions, showing the Nav2 costmap, route graph overlay, 2D LiDAR scan, and real-time YOLO26n detections fused into the semantic-navigation context. The monitoring view was used for observation only; mission outcomes and metrics were derived from structured audit logs. 7. Results Experiments were conducted across three controlled physical-robot sessions on… view at source ↗
Figure 7
Figure 7. Figure 7: Validated Xplorer multi-robot deployment. L0–L2 run on each robot's Raspberry Pi 5 (CPU-only); L3 and L5 run on a shared GPU workstation; L4 provides the operator/fleet interface. Blue arrows denote HTTP context requests from each robot's perception bridge; green arrows denote ROS 2 DDS navigation actions and pose feedback. Dashed purple arrows denote the offline memory cycle. Both robots share the same co… view at source ↗
Figure 8
Figure 8. Figure 8: Learning cycle - VLM inference times on Xplorer-C. Learning cycle: L3b VLM inference times for S3new on Xplorer￾C. 7 decisions with correct/incorrect nodes and M3 promotion threshold Note on implementation refinement. During Session A, an initial set of S3new runs produced inconsistent VLM resolutions due to YOLO false positives (transient chair detections) flooding the VLM context. This was identified as … view at source ↗
Figure 9
Figure 9. Figure 9: Transfer verification - L3a resolve times per category on Xplorer-B. Latency reduction ( view at source ↗
Figure 10
Figure 10. Figure 10: Latency comparison: L3b VLM (Xplorer-C) vs L3a M3 (Xplorer-B).(a) log-scale boxplot, (b) structured stats table. Semantic accuracy vs navigation success ( view at source ↗
Figure 11
Figure 11. Figure 11: Navigation outcome distribution on Xplorer-B (Session B). Semantic accuracy (100%) vs navigation completion (88%). 7.3. Deterministic consistency The deterministic control scenarios verified that the L3a resolver produces identical results on both robots when operating on the same navigation graph and static POI set: • S4 (“go to lab_cb204”): resolved to node 5 via L3a step 2 (node name match) on both Xpl… view at source ↗
read the original abstract

Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents the Semantic Autonomy Stack, a six-layer reference framework for VLM-integrated indoor mobile robots. It introduces hybrid deterministic-VLM reasoning via a seven-step parametric resolver that handles 88% of instructions deterministically in under 0.1 ms, and a five-category semantic memory framework with scope taxonomy for cross-robot knowledge transfer. Experimental validation on two custom differential-drive robots using Raspberry Pi 5 hardware demonstrates 100% accuracy on 82 scenario-level decisions, 100% semantic transfer accuracy on 33 cases, and a 103,000-fold latency reduction without training data or onboard GPUs.

Significance. If the central claims hold, the work would be significant for practical deployment of semantic navigation on edge devices, as it provides a concrete mechanism to minimize VLM usage while enabling cross-robot learning. The direct hardware measurements, zero-training-data requirement, and physical robot experiments are notable strengths that could influence future designs in resource-constrained robotics.

major comments (3)
  1. [Abstract / Experimental validation] The 100% semantic transfer accuracy (33/33 cases, 95% CI [0.894, 1.000]) and 88% resolver coverage are reported from 82 scenario-level decisions across three sessions on two robots, but the manuscript provides no details on instruction diversity, sampling method for the 82 cases, or observed failure modes. This is load-bearing for the claim that the five-category semantic memory taxonomy (global environment, per-operator preferences, per-robot capabilities) generalizes.
  2. [Abstract] The seven-step parametric resolver is presented as correctly and safely resolving 88% of instructions without VLM escalation, yet the decision thresholds are free parameters with no description of how they were set, tuned, or validated against edge cases. This directly affects the safety and reproducibility of the hybrid reasoning claim.
  3. [Abstract] No baselines (e.g., pure VLM performance on the same 82 decisions, alternative deterministic parsers, or other hybrid systems) or ablation studies are provided, so the 103,000-fold latency reduction and 100% resolution accuracy cannot be contextualized beyond the internal VLM comparison.
minor comments (2)
  1. [Abstract] The abstract states 'concurrent multi-robot operation feasibility' but gives no metrics (e.g., communication overhead, conflict resolution) or section reference for how this was measured.
  2. [Abstract] Clarify whether the 33 cross-robot transfer cases are a subset of the 82 decisions and how the shared compiled digest was constructed and validated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving clarity, reproducibility, and contextualization. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental validation] The 100% semantic transfer accuracy (33/33 cases, 95% CI [0.894, 1.000]) and 88% resolver coverage are reported from 82 scenario-level decisions across three sessions on two robots, but the manuscript provides no details on instruction diversity, sampling method for the 82 cases, or observed failure modes. This is load-bearing for the claim that the five-category semantic memory taxonomy (global environment, per-operator preferences, per-robot capabilities) generalizes.

    Authors: We agree that additional details on the 82 scenarios are required to substantiate the generalization claims for the semantic memory taxonomy. In the revised manuscript, we will expand the experimental validation section with a breakdown of instruction diversity (categorized by the five memory scopes), the sampling method (covering representative operator interactions across sessions), and a discussion of observed failure modes (none occurred in the tested set, with examples of edge cases that correctly trigger VLM escalation). This will directly address the load-bearing nature of these results. revision: yes

  2. Referee: [Abstract] The seven-step parametric resolver is presented as correctly and safely resolving 88% of instructions without VLM escalation, yet the decision thresholds are free parameters with no description of how they were set, tuned, or validated against edge cases. This directly affects the safety and reproducibility of the hybrid reasoning claim.

    Authors: The referee is correct that the tuning and validation process for the resolver thresholds requires explicit description to support reproducibility and safety claims. The full manuscript defines the resolver steps and thresholds but does not elaborate on their selection. We will revise the methods section to include how thresholds were set based on safety margins (to prevent incorrect deterministic resolutions), the tuning procedure using preliminary tests, and validation against a set of edge cases. This addition will clarify the hybrid reasoning approach. revision: yes

  3. Referee: [Abstract] No baselines (e.g., pure VLM performance on the same 82 decisions, alternative deterministic parsers, or other hybrid systems) or ablation studies are provided, so the 103,000-fold latency reduction and 100% resolution accuracy cannot be contextualized beyond the internal VLM comparison.

    Authors: We acknowledge that baselines and ablations would better contextualize the reported gains. The latency reduction is derived from direct measurements (deterministic path <0.1 ms vs. VLM 2-9 s), but we did not evaluate pure VLM accuracy across all 82 decisions. In revision, we will add a baseline table comparing latency and accuracy on the escalated subset (using existing VLM data), a deeper discussion of related deterministic parsers from the literature, and an ablation analysis of the resolver steps showing incremental coverage. Full pure-VLM accuracy on the entire set would require new experiments; we will note this limitation while providing the strongest contextualization possible with available measurements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper describes a six-layer Semantic Autonomy Stack with a seven-step parametric resolver and five-category semantic memory taxonomy, then reports performance via direct hardware experiments on two physical differential-drive robots (82 scenario decisions, 33/33 transfer cases). No mathematical derivations, first-principles predictions, or parameter fittings are presented that reduce to their own inputs by construction. Latency reduction (103,000-fold) and accuracy figures (100% semantic transfer and resolution) are stated as observed quantities from Raspberry Pi 5 deployments, not as outputs of any self-referential equation or fitted model renamed as prediction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text. The central results therefore remain independent of the framework description itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework introduces new architectural layers and memory categories whose correctness rests on domain assumptions about VLM reliability and resolver coverage rather than prior established results.

free parameters (1)
  • Decision thresholds in seven-step parametric resolver
    Parameters implicitly set to achieve the reported 88% deterministic coverage rate.
axioms (1)
  • domain assumption Vision-language models provide accurate semantic interpretations for genuinely ambiguous instructions when invoked
    Invoked for the 12% of cases escalated from the parametric resolver.
invented entities (2)
  • Semantic Autonomy Stack (six-layer reference framework) no independent evidence
    purpose: Organize hybrid deterministic-VLM reasoning and memory for semantic navigation
    Newly proposed architecture not derived from prior literature.
  • Five-category semantic memory framework with scope taxonomy no independent evidence
    purpose: Enable cross-session learning and cross-robot knowledge transfer
    New taxonomy distinguishing global, per-operator, and per-robot knowledge.

pith-pipeline@v0.9.0 · 5598 in / 1587 out tokens · 43507 ms · 2026-05-08T17:44:18.365257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    From the Desks of ROS Maintainers: A Survey of Modern & Capable Mobile Robotics Algorithms in the Robot Operating System 2,

    S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, "From the Desks of ROS Maintainers: A Survey of Modern & Capable Mobile Robotics Algorithms in the Robot Operating System 2," Robotics and Autonomous Systems, vol. 168, p. 104525, 2023. doi: 10.1016/j.robot.2023.104525

  2. [2]

    Model Predictive Path Integral Controller for Nav2,

    S. Macenski et al., "Model Predictive Path Integral Controller for Nav2," presented at ROSCon 2023. Available: https://navigation.ros.org/configuration/packages/configuring -mppic.html

  3. [3]

    Robotics and Computer-Integrated Manufactur- ing92, 102883 (2025) https://doi

    C. Zhang, Q. Xu, Y. Yu, G. Zhou, K. Zeng, F. Chang, and K. Ding, “A survey on potentials, pathways and challenges of large language models in new -generation intelligent manufacturing,” Robotics and Computer - Integrated Manufacturing, vol. 92, Article 102883, 2025. doi: 10.1016/j.rcim.2024.102883

  4. [4]

    A Survey on Robot Semantic Navigation Systems for Indoor Environments,

    R. Alqobali, M. Alshmrani, R. Alnasser, A. Rashidi, and T. Alhmiedat, "A Survey on Robot Semantic Navigation Systems for Indoor Environments," Applied Sciences, vol. 14, no. 1, p. 89, 2024. doi: 10.3390/app14010089

  5. [5]

    Kahneman, Thinking, Fast and Slow

    D. Kahneman, Thinking, Fast and Slow. New York: Farrar, Straus and Giroux, 2011

  6. [6]

    Iros: A dual-process architecture for real-time vlm-based indoor navigation,

    J. Lee, H. Shin, and J. Ko, "IROS: A Dual -Process Architecture for Real -Time VLM -Based Indoor Navigation," arXiv preprint arXiv:2601.21506, 2026

  7. [7]

    Lightweight Semantic-Aware Route Planning with Monocular Camera–2D LiDAR Fusion for Indoor Mobile Robots,

    B. F. Abaza, A.-A. Staicu, and C. V. Doicin, "Lightweight Semantic-Aware Route Planning with Monocular Camera–2D LiDAR Fusion for Indoor Mobile Robots," Sensors, vol. 26, no. 7, p. 2232, 2026. doi: 10.3390/s26072232

  8. [8]

    arXiv preprint arXiv:2508.13073 , year=

    R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie, "Large VLM -based Vision-Language- Action Models for Robotic Manipulation: A Survey," arXiv preprint arXiv:2508.13073, 2025

  9. [9]

    Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012, 2025

    D. Zhang, J. Sun, C. Hu, X. Wu, Z. Yuan, R. Zhou, F. Shen, and Q. Zhou, "Pure Vision Language Action (VLA) Models: A Comprehensive Survey," arXiv preprint arXiv:2509.19012, 2025

  10. [10]

    Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar

    A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar, "Actions as Language: Fine -Tuning VLMs into VLAs Without Catastrophic Forgetting," in Proc. ICLR, 2025. arXiv:2509.22195

  11. [11]

    Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    H. Shi, B. Xie, Y. Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang, "MemoryVLA: Perceptual-Cognitive Memory in Vision -Language-Action Models for Robotic Manipulation," arXiv preprint arXiv:2508.19236, 2025

  12. [12]

    Human -like Navigation in a World Built for Humans,

    B. Chandaka, G. X. Wang, H. Chen, H. Che, A. J. Zhai, and S. Wang, "Human -like Navigation in a World Built for Humans," in Proc. CoRL, Seoul, Korea, 2025. arXiv:2509.21189

  13. [13]

    VLM -Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models,

    D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, "VLM -Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models," arXiv preprint arXiv:2404.00210, 2024

  14. [14]

    VLM-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing,

    T. Wang et al., "VLM-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing," Engineering, 2025. doi: 10.1016/j.eng.2025.04.028

  15. [15]

    Co-navgpt: Multi-robot co- operative visual semantic navigation using large language models.arXiv preprint arXiv:2310.07937, 2023a

    B. Yu, Q. Yuan, K. Li, H. Kasaei, and M. Cao, "Co -NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models," arXiv preprint arXiv:2310.07937, 2023, revised 2025

  16. [16]

    arXiv preprint arXiv:2510.26909 (2025) 8

    T. Windecker, M. Patel, M. Reuss, R. Schwarzkopf, C. Cadena, R. Lioutikov, M. Hutter, and J. Frey, "NaviTrace: Evaluating Embodied Navigation of Vision -Language Models," arXiv preprint arXiv:2510.26909, 2025

  17. [17]

    RANa: Retrieval-Augmented Navigation,

    G. Monaci, R. S. Rezende, R. Deffayet, G. Csurka, G. Bono, H. Déjean, S. Clinchant, and C. Wolf, "RANa: Retrieval-Augmented Navigation," arXiv preprint arXiv:2504.03524, 2025

  18. [18]

    CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual -Process Thinking,

    Y. Huang, L. Liu, S. Lei, Y. Ma, H. Su, J. Mei, P. Zhao, Y. Gu, Y. Liu, and J. Lv, "CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual -Process Thinking," in Proc. 33rd ACM Int. Conf. Multimedia (MM '25), Dublin, Ireland, 2025, pp. 1–10. doi: 10.1145/3746027.3755832

  19. [19]

    Hydra -Nav: Object Navigation via Adaptive Dual-Process Reasoning,

    Z. Wang, H. Fang, S. Wang, Y. Luo, H. Dong, W. Li, and Y. Gan, "Hydra -Nav: Object Navigation via Adaptive Dual-Process Reasoning," arXiv preprint arXiv:2602.09972, 2026

  20. [20]

    A dual process vla: Efficient robotic manipulation leveraging vlm.arXiv preprint arXiv:2410.15549,

    B. Han, J. Kim, and J. Jang, "A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM," arXiv preprint arXiv:2410.15549, 2024

  21. [21]

    MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM -based Vision -and-Language Navigation,

    L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu, "MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM -based Vision -and-Language Navigation," in Proc. ACL (Volume 1: Long Papers), Vienn a, Austria, 2025, pp. 13032 –13056. doi: 10.18653/v1/2025.acl-long.638

  22. [22]

    CausalNav: A Long -term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios,

    H. Duan, S. Luo, Z. Deng, Y. Chen, Y. Chiang, Y. Liu, F. Liu, and X. Wang, "CausalNav: A Long -term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios," IEEE Robotics and Automation Letters, 2026. arXiv:2601.01872

  23. [23]

    Meta -Memory: Retrieving and Integrating Semantic - Spatial Memories for Robot Spatial Reasoning,

    Y. Mao, H. Ye, W. Dong, C. Zhang, and H. Zhang, "Meta -Memory: Retrieving and Integrating Semantic - Spatial Memories for Robot Spatial Reasoning," arXiv preprint arXiv:2509.20754, 2025

  24. [24]

    EchoVLA: Synergistic Declarative Memory for VLA -Driven Mobile Manipulation,

    M. Lin, X. Liang, B. Lin, L. Jingzhi, Z. Jiao, K. Li, Y. Sun, W. Liufu, Y. Ma, Y. Liu, S. Zhao, Y. Zhuang, and X. Liang, "EchoVLA: Synergistic Declarative Memory for VLA -Driven Mobile Manipulation," arXiv preprint arXiv:2511.18112, 2025

  25. [25]

    Memory in the Age of AI Agents

    Y. Hu, S. Liu, Y. Yue, G. Zhang et al., "Memory in the Age of AI Agents," arXiv preprint arXiv:2512.13564, 2025

  26. [26]

    ROSClaw: OpenClaw ROS 2 Framework for Agentic Robot Control,

    J. Cardenas et al., "ROSClaw: OpenClaw ROS 2 Framework for Agentic Robot Control," arXiv preprint arXiv:2603.26997, 2026

  27. [27]

    Hanchen Wang, Jean Kaddour, Shengchao Liu, et al

    A. Mower, S. Wan et al., "A Robot Operating System Framework for Using Large Language Models in Embodied AI," Nature Machine Intelligence, 2026. doi: 10.1038/s42256 -026-01186-z

  28. [28]

    Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent,

    R. Royce, M. Kaufmann, J. Becktor, S. Moon, K. Carpenter, K. Pak, A. Towler, R. Thakker, and S. Khattak, "Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent," arXiv preprint arXiv:2410.06472, 2024

  29. [29]

    RAI: A Flexible Agent Framework for Embodied AI,

    B. Rachwał et al., "RAI: A Flexible Agent Framework for Embodied AI," arXiv preprint arXiv:2505.07532, 2025

  30. [30]

    Agentic LLM-based Robotic Systems for Real- World Applications: A Review on Their Agenticness and Ethics,

    E. K. Raptis, A. Ch. Kapoutsis, and E. B. Kosmatopoulos, "Agentic LLM-based Robotic Systems for Real- World Applications: A Review on Their Agenticness and Ethics," Frontiers in Robotics and AI, vol. 12, p. 1605405, 2025. doi: 10.3389/frobt.2025.1605405

  31. [31]

    Empowering natural human – robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives,

    D. Wu, P. Zheng, Q. Zhao, S. Zhang, J. Qi, J. Hu, G. -N. Zhu, and L. Wang, “Empowering natural human – robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives,” Robotics and Computer -Integrated Manufactu ring, vol. 97, Article 103064, 2026. doi: 10.1016/j.rcim.2025.103064

  32. [32]

    Review and perspectives on multimodal perception, mutual cognition, and embodied execution for human–robot collaboration in Industry 5.0,

    K. Ding, Q. Mao, Y . Zhang, Y . Zhang, P. Zheng, and L. Wang, “Review and perspectives on multimodal perception, mutual cognition, and embodied execution for human–robot collaboration in Industry 5.0,” Robotics and Computer-Integrated Manufacturing, vol. 101, p. 103280, 2026. doi: 10.1016/j.rcim.2026.103280

  33. [33]

    Version 2.0.0

    VDA 5050: Interface for the Communication between Automated Guided Vehicles (AGV) and a Master Control. Version 2.0.0. VDA Technical Committee. https://github.com/VDA5050/VDA5050

  34. [34]

    Perception-decision- execution coordination mechanism driven dynamic autonomous collaboration method for human -like collaborative robot based on multimodal large languag e model,

    J. Chen, S. Huang, X. Wang, P. Wang, J. Zhu, Z. Xu, G. Wang, Y . Yan, and L. Wang, “Perception-decision- execution coordination mechanism driven dynamic autonomous collaboration method for human -like collaborative robot based on multimodal large languag e model,” Robotics and Computer -Integrated Manufacturing, vol. 98, Article 103167, 2026. doi: 10.1016...