pith. machine review for the scientific record. sign in

arxiv: 2604.21924 · v1 · submitted 2026-04-23 · 💻 cs.RO

Recognition: unknown

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

An-Chieh Cheng, Geng Chen, Hongxu Yin, Isabella Liu, Ri-Zhao Qiu, Rui Yan, Sha Yi, Sifei Liu, Xiaolong Wang, Xueyan Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords long-horizon manipulationvision-language-action policiesvisual tracestask-management VLMreceding-horizon planningrobot learningembodied AIimplicit closed-loop control
0
0 comments X

The pith

Predicting the remaining plan at each step creates an implicit closed loop that lets vision-language-action policies complete long-horizon manipulation tasks by automatically replanning after errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a modular way to scale short-horizon robot policies to extended sequences of actions by adding a separate task-management model that, from the current camera view, outputs both a split list of completed and remaining subtasks and a compact visual trace of 2D keypoints indicating the next path. The executor policy is then conditioned only on that rendered trace, converting the overall problem into repeated local control steps. If the approach holds, it removes the need for manually written recovery behaviors or stored visual histories because any persistent failure simply remains visible in the next predicted plan and gets addressed automatically. Experiments on simulated and real-robot tasks show gains in success rate, robustness, and generalization to new situations.

Core claim

LoHo-Manip decouples a dedicated task-management vision-language model from the executor vision-language-action policy. The manager runs in receding-horizon fashion: given the latest observation it predicts a progress-aware subtask sequence with an explicit done-plus-remaining split together with a visual trace consisting of a compact 2D keypoint trajectory. The executor is adapted to follow the rendered trace, turning long-horizon decisions into repeated short-horizon control. Because the manager re-predicts the entire remaining plan at every step, failed actions stay in the output and the trace updates, producing an implicit closed loop for continuation and replanning without hand-crafted

What carries the argument

The receding-horizon task-management VLM that repeatedly generates a progress-aware subtask sequence and a 2D keypoint visual trace to condition the executor VLA policy.

If this is right

  • Failed execution steps remain visible in subsequent manager outputs and are automatically addressed without separate recovery code.
  • Long-horizon manipulation becomes repeated local control conditioned on fresh traces rather than monolithic planning.
  • The framework applies to embodied planning, long-horizon reasoning, trajectory prediction, and real-robot manipulation without brittle visual-history buffers.
  • Out-of-distribution generalization improves because the manager continuously re-interprets the current observation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation between manager and executor may allow independent scaling or replacement of either component for different robot hardware.
  • Visual traces defined as 2D keypoints could serve as a lightweight interface for combining the system with other motion planners.
  • Testing the same manager on tasks with larger environmental changes would reveal how far the implicit replanning extends before prediction errors accumulate.

Load-bearing premise

The task-management vision-language model must reliably output accurate remaining subtasks and useful visual traces from current observations across different environments and task distributions.

What would settle it

Run the manager model on a set of held-out long-horizon tasks and check whether its predicted subtask sequences and traces produce lower end-to-end success rates than a plain short-horizon VLA baseline on the same tasks.

Figures

Figures reproduced from arXiv: 2604.21924 by An-Chieh Cheng, Geng Chen, Hongxu Yin, Isabella Liu, Ri-Zhao Qiu, Rui Yan, Sha Yi, Sifei Liu, Xiaolong Wang, Xueyan Zou.

Figure 1
Figure 1. Figure 1: Our task manager decouples high-level planning from low-level control by de￾composing a high-level instruction into sequential sub-tasks and tracking their comple￾tion over time. For each step, it predicts a spatial visual trace that acts as an actionable prompt for a low-level VLA executor, enabling reliable execution of complex multi-step manipulation tasks. Project page: https://www.liuisabella.com/LoHo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LoHo-Manip framework. Given a task instruction and the current observation, a vision–language task manager predicts the next sub-task to￾gether with a spatial visual trace indicating the intended interaction trajectory. A low-level executor then performs short-horizon control conditioned on this guidance. The system operates in a closed loop: after executing actions, new observations are fed ba… view at source ↗
Figure 3
Figure 3. Figure 3: Data Pipeline. Given RGB manipulation videos, we use vision–language foun￾dation models to perform frame grounding, object detection, and captioning to identify interaction events. The trajectory is segmented into atomic subtasks with correspond￾ing start and end frames, while the robot end-effector positions are extracted to form a 2D visual trace. Additional grounding identifies target and graspable obje… view at source ↗
Figure 4
Figure 4. Figure 4: Samples of Sub-task Planning and Error Recovery. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results on the semantic instruction track of VLABench. The task involves high￾level abstract instructions that require reasoning to infer the appropriate manipulation actions. We visualize both the predicted subtasks and the corresponding trajectory generated by our method. VLABench Methods In Distribution ↑ Cross Category ↑ Common Sense ↑ Semantic Instr. ↑ Unseen Texture ↑ Average ↑ π0-fast [51] 0.29 0.18… view at source ↗
Figure 6
Figure 6. Figure 6: Robot execution results on multi-step tasks. We evaluate performance under both in-distribution and out-of-distribution settings, including scenarios with unseen objects and novel subtask combinations [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Left) Real robot results. LoHo-Manip significantly outperforms π0.5 (our base VLA finetuned with the same data) in out-of-distribution settings involving novel instructions and objects, demonstrating superior generalization. (Right) IS and PS comparison on VLABench. LoHo-Manip consistently exceeds baselines in metrics requiring robust vision-language reasoning, moving beyond simple success rates. logic an… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of trajectory prediction on out-of-distribution em [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results on different tracks of VLABench. We show the predicted sub-tasks and their corresponding trajectory [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative results from real robot experiments. We evaluate our model on different long-horizon manipulation tasks. C More Ablation Study C.1 Quality of Sub-Task and Trace Curation We study the effectiveness of the proposed sub-task and trace construc￾tion in our data pipeline for generating training data from real robot demonstrations. We compare the performance before and after incorpo￾ratin… view at source ↗
read the original abstract

Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially, predicting the remaining plan at each step yields an implicit closed loop: failed steps persist in subsequent outputs, and traces update accordingly, enabling automatic continuation and replanning without hand-crafted recovery logic or brittle visual-history buffers. Extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation in simulation and on a real Franka robot demonstrate strong gains in long-horizon success, robustness, and out-of-distribution generalization. Project page: https://www.liuisabella.com/LoHoManip

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LoHo-Manip, a modular framework for long-horizon robotic manipulation. It decouples a task-management VLM from a short-horizon VLA executor: the VLM is invoked receding-horizon style on the current observation to output a progress-aware remaining plan consisting of (i) a subtask sequence with explicit done/remaining language split and (ii) a visual trace (compact 2D keypoint trajectory). The executor VLA is conditioned on the rendered trace to reduce long-horizon control to repeated local following. The central claim is that repeated prediction of the remaining plan creates an implicit closed loop that automatically continues or replans on failures without hand-crafted recovery logic or visual-history buffers. Experiments across embodied planning, trajectory prediction, and real Franka manipulation report strong gains in success, robustness, and OOD generalization.

Significance. If the central mechanism holds, the work offers a practical route to scale existing short-horizon VLAs to long-horizon tasks by adding a lightweight modular manager and visual-trace conditioning. This could reduce reliance on brittle recovery heuristics and improve generalization, which are persistent bottlenecks in real-world manipulation. The approach is modular and re-uses off-the-shelf VLAs, which is a positive engineering contribution.

major comments (3)
  1. [Abstract / method overview] Abstract and method description: the implicit closed-loop claim rests on the task-management VLM reliably outputting an accurate done/remaining split and corrective traces from the current observation alone. No quantitative analysis of progress-prediction error rates, failure modes under partial observability, or visual aliasing is provided; if the VLM drops failed subtasks, the loop collapses and the claimed robustness advantage disappears. This is load-bearing for the central contribution.
  2. [Experiments] Experiments (assumed §4–5): while 'strong gains' are stated, the manuscript must report per-task success rates, number of trials, exact baselines (e.g., VLA with history buffers, other receding-horizon planners), and error bars. Without these, it is impossible to assess whether the reported improvements are statistically meaningful or driven by the trace-conditioning mechanism versus other factors.
  3. [Method] Method section on visual trace: the 2D keypoint trajectory is introduced as a compact prompt, but the rendering process, how keypoints are chosen or updated, and the precise conditioning mechanism into the VLA (e.g., tokenization, training adaptations) are not specified in sufficient detail to reproduce or analyze the executor adaptation.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation'; a clearer mapping from these categories to the reported tables/figures would improve readability.
  2. [Method] Notation for the done/remaining split and trace rendering should be formalized (e.g., as equations or pseudocode) rather than left in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of our modular framework. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / method overview] Abstract and method description: the implicit closed-loop claim rests on the task-management VLM reliably outputting an accurate done/remaining split and corrective traces from the current observation alone. No quantitative analysis of progress-prediction error rates, failure modes under partial observability, or visual aliasing is provided; if the VLM drops failed subtasks, the loop collapses and the claimed robustness advantage disappears. This is load-bearing for the central contribution.

    Authors: We agree that the reliability of the VLM's progress prediction is central to the implicit closed-loop mechanism. While our end-to-end results demonstrate overall robustness, we did not include isolated quantitative metrics on progress-prediction accuracy. In the revision, we will add a new analysis subsection reporting error rates for done/remaining splits, performance under partial observability, and failure modes including visual aliasing, with metrics on how often the VLM correctly identifies completed subtasks and generates corrective traces. revision: yes

  2. Referee: [Experiments] Experiments (assumed §4–5): while 'strong gains' are stated, the manuscript must report per-task success rates, number of trials, exact baselines (e.g., VLA with history buffers, other receding-horizon planners), and error bars. Without these, it is impossible to assess whether the reported improvements are statistically meaningful or driven by the trace-conditioning mechanism versus other factors.

    Authors: We agree that more granular reporting is required. The current manuscript includes per-task success rates and baseline comparisons, but we will revise the experiments section to explicitly report the number of trials per task, include error bars (standard deviation), and detail exact baseline implementations such as VLA with history buffers and other receding-horizon planners. This will enable assessment of statistical significance and the specific contribution of trace conditioning. revision: yes

  3. Referee: [Method] Method section on visual trace: the 2D keypoint trajectory is introduced as a compact prompt, but the rendering process, how keypoints are chosen or updated, and the precise conditioning mechanism into the VLA (e.g., tokenization, training adaptations) are not specified in sufficient detail to reproduce or analyze the executor adaptation.

    Authors: We agree that the visual trace details require expansion for reproducibility. We will update the method section to specify the rendering process for the 2D keypoint trajectories, the criteria for selecting and updating keypoints based on subtask goals and observations, and the precise conditioning into the VLA including tokenization and any training adaptations to the executor. We will add pseudocode and figures to illustrate these steps. revision: yes

Circularity Check

0 steps flagged

No circularity: modular framework with independent components and external validation

full rationale

The paper introduces LoHo-Manip as a decoupled modular system: a task-management VLM predicts progress-aware subtask sequences plus visual traces in receding horizon, and an executor VLA conditions on rendered traces. The implicit closed-loop property follows directly from the design choice of re-predicting the remaining plan at each step with explicit done/remaining split; it is not obtained by redefining any quantity in terms of itself or by fitting a parameter to a subset and relabeling the output as a prediction. No equations, ansatzes, or uniqueness theorems are invoked that reduce the central claims to prior self-citations or inputs by construction. Experiments across simulation, real-robot, and out-of-distribution settings supply independent empirical support, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's contribution is primarily the architectural design and the assumption that these components work together effectively, with experiments supporting but not fully detailing the assumptions.

axioms (1)
  • domain assumption The vision-language model can generate accurate and useful subtask sequences and visual traces from single observations
    This is central to the manager's role and not proven in the abstract.
invented entities (1)
  • Visual trace as 2D keypoint trajectory no independent evidence
    purpose: To provide spatial guidance to the VLA executor for local control
    Introduced as a new conditioning mechanism in this framework.

pith-pipeline@v0.9.0 · 5575 in / 1404 out tokens · 54758 ms · 2026-05-09T21:06:51.079750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: CoRL (2022)

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. In: CoRL (2022)

  3. [3]

    In: NeurIPS (2025)

    Ahn, S., Choi, W., Lee, J., Park, J., Woo, H.: Towards reliable code- as-policies: A neuro-symbolic framework for embodied task planning. In: NeurIPS (2025)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  6. [6]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

    Barreiros, J., Beaulieu, A., Bhat, A., Cory, R., Cousineau, E., Dai, H., Fang, C.H., Hashimoto, K., Irshad, M.Z., Itkina, M., et al.: A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331 (2025)

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  8. [8]

    In: RSS (2025)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision- language-action flow model for general robot control. In: RSS (2025)

  9. [9]

    In: RSS (2023)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: RSS (2023)

  10. [10]

    arXiv preprint arXiv: (2025)

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., Zhao, D., Chen, H.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv: (2025)

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  12. [12]

    In: RSS (2025)

    Cheng, A.C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X.: Navila: Legged robot vision-language- action model for navigation. In: RSS (2025)

  13. [13]

    In: RSS (2023)

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: RSS (2023)

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: 22 I. Liu et al. Gemini 2.5: Pushing the frontier with advanced reasoning, mul- timodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  15. [15]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    Community, S.: Starvla: A lego-like codebase for vision-language- action model developing. arXiv preprint arXiv:2604.05014 (2026)

  16. [16]

    RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026

    Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)

  17. [17]

    In: ICRA (2025)

    Dasari, S., Mees, O., Zhao, S., Srirama, M.K., Levine, S.: The ingre- dients for robotic diffusion transformers. In: ICRA (2025)

  18. [18]

    In: ICRA (2017)

    Devin, C., Gupta, A., Darrell, T., Abbeel, P., Levine, S.: Learn- ing modular neural network policies for multi-task and multi-robot transfer. In: ICRA (2017)

  19. [19]

    In: ICML (2023)

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. In: ICML (2023)

  20. [20]

    In: CoRL (2025)

    Fan, Y., Ding, P., Bai, S., Tong, X., Zhu, Y., Lu, H., Dai, F., Zhao, W., Liu, Y., Huang, S., et al.: Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In: CoRL (2025)

  21. [21]

    Robix: A unified model for robot interaction, reasoning and planning

    Fang, H., Zhang, M., Dong, H., Li, W., Wang, Z., Zhang, Q., Tian, X., Hu, Y., Li, H.: Robix: A unified model for robot interaction, reasoning and planning. arXiv preprint arXiv:2509.01106 (2025)

  22. [22]

    In: CoRL (2021)

    Florence, P., Lynch, C., Zeng, A., Ramirez, O.A., Wahid, A., Downs, L., Wong, A., Lee, J., Mordatch, I., Tompson, J.: Implicit behavioral cloning. In: CoRL (2021)

  23. [23]

    Garrett, C.R., Chitnis, R., Holladay, R., Kim, B., Silver, T., Kael- bling, L.P., Lozano-Pérez, T.: Integrated task and motion plan- ning.AnnualReviewofControl,Robotics,andAutonomousSystems (2021)

  24. [24]

    Google DeepMind: Gemini 3 Flash: frontier intelligence built for speed.https://deepmind.google/models/gemini/flash(2025)

  25. [25]

    Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation.CoRR, abs/2602.09849, 2026

    Hu, Y., Zhang, J., Luo, Y., Guo, Y., Chen, X., Sun, X., Feng, K., Lu, Q., Chen, S., Zhang, Y., Li, W., Chen, J.: Bagelvla: Enhanc- ing long-horizon manipulation via interleaved vision-language-action generation. arXiv preprint arXiv:2602.09849 (2026)

  26. [26]

    In: CVPR (2026)

    Huang, C.P., Man, Y., Yu, Z., Chen, M.H., Kautz, J., Wang, Y.C.F., Yang, F.E.: Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning. In: CVPR (2026)

  27. [27]

    In: NeurIPS (2025)

    Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision-language-action reasoning via reinforced visual la- tent planning. In: NeurIPS (2025)

  28. [28]

    In: CoRL (2023)

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Vox- poser: Composable 3d value maps for robotic manipulation with lan- guage models. In: CoRL (2023)

  29. [29]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al.: Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022) Lo-Ho Manip 23

  30. [30]

    Nora: A small open-sourced generalist vision language action model for embodied tasks, 2025

    Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Ma- jumder, N., Poria, S., et al.: Nora: A small open-sourced general- ist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 (2025)

  31. [31]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  32. [32]

    In: ICML (2022)

    Janner, M., Du, Y., Tenenbaum, J.B., Levine, S.: Planning with diffusion for flexible behavior synthesis. In: ICML (2022)

  33. [33]

    In: CVPR (2025)

    Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., et al.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In: CVPR (2025)

  34. [34]

    In: ICML (2023)

    Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei- Fei, L., Anandkumar, A., Zhu, Y., Fan, L.: Vima: Robot manipula- tion with multimodal prompts. In: ICML (2023)

  35. [35]

    Kaelbling, L.P., Lozano-Pérez, T.: Hierarchical task and motion planning in the now. Tech. rep. (2010),https://dspace.mit.edu/ handle/1721.1/54780

  36. [36]

    In: CoRL (2024)

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Open- vla: An open-source vision-language-action model. In: CoRL (2024)

  37. [37]

    In: RSS (2021)

    Kumar, A., Fu, Z., Pathak, D., Malik, J.: Rma: Rapid motor adap- tation for legged robots. In: RSS (2021)

  38. [38]

    In: ICRA (2026)

    Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., et al.: Molmoact: Action reasoning models that can reason in space. In: ICRA (2026)

  39. [39]

    In: ICCV (2025)

    Li, D., Zhang, Y., Cao, M., Liu, D., Xie, W., Hui, T., Lin, L., Xie, Z., Li, Y.: Towards long-horizon vision-language-action system: Rea- soning, acting and memory. In: ICCV (2025)

  40. [40]

    In: ICLR (2025)

    Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., et al.: Hamster: Hierarchical action models for open-world robot manipulation. In: ICLR (2025)

  41. [41]

    In: ICRA (2023)

    Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Flo- rence, P., Zeng, A.: Code as policies: Language model programs for embodied control. In: ICRA (2023)

  42. [42]

    In: NeurIPS (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. In: NeurIPS (2023)

  43. [43]

    In: ICLR (2025)

    Liu, Y., Hamid, J.I., Xie, A., Lee, Y., Du, M., Finn, C.: Bidirectional decoding: Improving action chunking via guided test-time sampling. In: ICLR (2025)

  44. [44]

    In: CVPR (2025)

    Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: CVPR (2025)

  45. [45]

    In: The Inter- national FLAIRS Conference Proceedings (2022)

    McArthur, M., Moshfeghi, Y., Cashmore, M.: Egoplan: A framework for multi-agent planning using single agent planners. In: The Inter- national FLAIRS Conference Proceedings (2022)

  46. [46]

    Liu et al

    Meta AI: Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.https://ai.meta.com/blog/llama- 3- 2- connect-2024-vision-edge-mobile-devices(2024) 24 I. Liu et al

  47. [47]

    nvidia.com/labs/gear/gr00t-n1_5(2025)

    NVIDIA Research: GR00T N1.5: An Improved Open Founda- tion Model for Generalist Humanoid Robots.https://research. nvidia.com/labs/gear/gr00t-n1_5(2025)

  48. [48]

    In: RSS (2024)

    Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. In: RSS (2024)

  49. [49]

    https : / / openai

    OpenAI: GPT-4o Mini: Advancing Cost-Efficient Intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient-intelligence(2024)

  50. [50]

    In: ICRA (2024)

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0. In: ICRA (2024)

  51. [51]

    In: RSS (2025)

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. In: RSS (2025)

  52. [52]

    arXiv preprint arXiv:2412.04447 (2024)

    Qiu, L., Chen, Y., Ge, Y., Ge, Y., Shan, Y., Liu, X.: Egoplan-bench2: A benchmark for multimodal large language model planning in real- world scenarios. arXiv preprint arXiv:2412.04447 (2024)

  53. [53]

    In: RSS (2025)

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial rep- resentations for visual-language-action model. In: RSS (2025)

  54. [54]

    Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

    Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., Suen- derhauf, N.: Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. arXiv preprint arXiv:2307.06135 (2023)

  55. [55]

    In: Proceed- ings of the fourteenth international conference on artificial intelli- gence and statistics

    Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceed- ings of the fourteenth international conference on artificial intelli- gence and statistics. pp. 627–635. JMLR Workshop and Conference Proceedings (2011)

  56. [56]

    In: ICRA (2024)

    Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakr- ishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N.J., et al.: Robovqa: Multimodal long-horizon reasoning for robotics. In: ICRA (2024)

  57. [57]

    In: ICRA (2023)

    Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., Garg, A.: Progprompt: Generating situated robot task plans using large language models. In: ICRA (2023)

  58. [58]

    arXiv preprint arXiv:2601.14352 , year=

    Tan, H., Zhou, E., Li, Z., Xu, Y., Ji, Y., Chen, X., Chi, C., Wang, P., Jia, H., Ao, Y., et al.: Robobrain 2.5: Depth in sight, time in mind. arXiv preprint arXiv:2601.14352 (2026)

  59. [59]

    arXiv preprint arXiv:2507.02029 , year=

    Team, B.R.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)

  60. [60]

    Torne, M., Pertsch, K., Walke, H., Vedder, K., Nair, S., Ichter, B., Ren, A.Z., Wang, H., Tang, J., Stachowicz, K., et al.: Mem: Multi- scale embodied memory for vision language action models.https: //www.pi.website/download/Mem.pdf(2026)

  61. [61]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wu, W., Lu, F., Wang, Y., Yang, S., Liu, S., Wang, F., Zhu, Q., Sun, H., Wang, Y., Ma, S., Ren, Y., Zhang, K., Yu, H., Zhao, J., Zhou, S., Lo-Ho Manip 25 Qiu, Z., Xiong, H., Wang, Z., Wang, Z., Cheng, R., Li, Y., Huang, Y., Zhu, X., Shen, Y., Zheng, K.: A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692 (2026)

  62. [62]

    arXiv preprint arXiv:2511.20887 (2025)

    Yan, R., Fu, J., Yang, S., Paulsen, L., Cheng, X., Wang, X.: Ace-f: A cross embodiment foldable system with force feedback for dexterous teleoperation. arXiv preprint arXiv:2511.20887 (2025)

  63. [63]

    In: CVPR (2025)

    Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., Gu, Y., Cai, M., Ye, S., Jang, J., et al.: Magma: A foundation model for multimodal ai agents. In: CVPR (2025)

  64. [64]

    In: ICML (2025)

    Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T.V., Movahedi, M., Li, M., et al.: Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. In: ICML (2025)

  65. [65]

    In: ICLR (2026)

    Yuan, Y., Cui, H., Chen, Y., Dong, Z., Ni, F., Kou, L., Liu, J., Li, P., Zheng, Y., Hao, J.: From seeing to doing: Bridging reasoning and decision for robotic manipulation. In: ICLR (2026)

  66. [66]

    In: ICLR (2026)

    Yuan, Y., Cui, H., Huang, Y., Chen, Y., Ni, F., Dong, Z., Li, P., Zheng, Y., Hao, J.: Embodied-r1: Reinforced embodied reasoning for general robotic manipulation. In: ICLR (2026)

  67. [67]

    In: ICCV (2025)

    Zhang, S., Xu, Z., Liu, P., Yu, X., Li, Y., Gao, Q., Fei, Z., Yin, Z., Wu, Z., Jiang, Y.G., et al.: Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon rea- soning tasks. In: ICCV (2025)

  68. [68]

    In: CVPR (2025)

    Zhao,Q.,Lu,Y.,Kim,M.J.,Fu,Z.,Zhang,Z.,Wu,Y.,Li,Z.,Ma,Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In: CVPR (2025)

  69. [69]

    In: ICLR (2025)

    Zheng, R., Liang, Y., Huang, S., Gao, J., III, H.D., Kolobov, A., Huang, F., Yang, J.: TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. In: ICLR (2025)

  70. [70]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced train- ing and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

  71. [71]

    In: CoRL (2023)

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language- action models transfer web knowledge to robotic control. In: CoRL (2023)