pith. machine review for the scientific record. sign in

arxiv: 2603.15620 · v2 · submitted 2026-03-16 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Towards Generalizable Robotic Manipulation in Dynamic Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:46 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords dynamic manipulationvision-language-action modelsoptical flowrobotic manipulationDOMINO datasetPUMA architecturespatiotemporal reasoninggeneralization
0
0 comments X

The pith

PUMA integrates scene-centric historical optical flow with specialized world queries to implicitly forecast object states, delivering a 6.3% success-rate gain on dynamic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models handle static scenes but lose performance when targets move, largely because existing datasets omit dynamics and models process only single frames. The work supplies DOMINO, a benchmark of 35 hierarchical tasks and 110K trajectories, to close that data gap and to test training strategies for temporal awareness. PUMA then couples history-aware perception to short-horizon prediction by feeding optical-flow histories and world queries into the model, letting it anticipate object-centric futures without explicit physics. Experiments show the resulting architecture reaches state-of-the-art success rates and that dynamic training also strengthens static-task performance.

Core claim

By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction and achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines on dynamic manipulation tasks.

What carries the argument

PUMA architecture that integrates scene-centric historical optical flow and specialized world queries to enable implicit short-horizon prediction of object-centric future states.

If this is right

  • Training on dynamic trajectories produces spatiotemporal representations that transfer to and improve performance on static manipulation tasks.
  • Existing VLA models can be ranked systematically by their ability to maintain success across the benchmark's multi-dimensional dynamic evaluation suite.
  • Hierarchical task complexities in the dataset expose which components of spatiotemporal reasoning remain weakest in current models.
  • Short-horizon implicit forecasting reduces reliance on separate motion-prediction modules during policy execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same flow-plus-query pattern could be stacked across multiple time steps to support longer-horizon planning without changing the core architecture.
  • Because the method avoids explicit object tracking, it may generalize more readily to cluttered scenes where individual object identities are hard to maintain.
  • Combining the learned implicit forecasts with lightweight physics priors at inference time could further raise robustness when prediction errors accumulate.

Load-bearing premise

Scene-centric historical optical flow combined with specialized world queries supplies enough information for reliable short-horizon forecasts of object motion without explicit physics models.

What would settle it

Ablating the optical-flow input from PUMA and measuring whether success rates on the DOMINO dynamic tasks fall back to baseline levels.

Figures

Figures reproduced from arXiv: 2603.15620 by Dingkang Liang, Heng Fang, Shangru Li, Shuhan Wang, Xiang Bai, Xuanyang Xi.

Figure 1
Figure 1. Figure 1: (a) Illustration of the defined dynamic difficulty levels, progressing from static (Level 0) to stochastic and abrupt dynamics (Level 3). (b) Dynamic awareness requires capturing historical context and anticipating future motion. (c) Performance of SOTA models degrades when shifting from static to dynamic environments. yet underexplored frontier. Dynamic manipulation is inherently difficult as it imposes s… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Visualization. We present DOMINO dataset of 117,000 dynamic manipulation trajectories, covering 35 distinct tasks across five robot embodiments. (e.g., RGB-D) and proprioception. The continuous action at ∈ A specifies the dual-arm control commands. In dynamic environments, the transition dynamics T (st+1|st, at) are inherently time-varying, governed by both the independent motion of the object and … view at source ↗
Figure 3
Figure 3. Figure 3: PUMA processes historical motion flows, current observations, and instruc￾tions to encode scene-centric historical dynamics. It employs a dual-query mechanism where Action Queries decode continuous action chunks and World Queries aggregate dynamic representations. During training, world queries are supervised via a similarity loss against future features extracted by DINO to predict object-centric dynamics… view at source ↗
Figure 4
Figure 4. Figure 4: Performance degradation of the ACT model across three dy￾namic complexity. To investigate the challenges of dy￾namic environments, we evaluate representa￾tive VLA architectures in both static and dy￾namic settings. As shown in Tab. 1, these models exhibit satisfactory performance in static scenarios (S→S), but their performance degrades significantly in dynamic environ￾ments (S→D). Under identical task set… view at source ↗
Figure 5
Figure 5. Figure 5: PUMA performs significantly better than other baselines on difficult tasks. Specifically, PUMA achieves the highest average success rate of 17.20%, substantially outperforming recent strong baselines such as OpenVLA￾OFT (Qwen3-VL-based) [11] (10.90%) and π0.5 [4] (9.63%). Furthermore, our method attains a peak Manip￾ulation Score of 34.97, indicating a higher quality of interaction with moving targets. We … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative demonstrations on the DOMINO dataset across hierarchical dy￾namic complexities. The first two columns illustrate expert trajectories in the clean setting, while the last two columns present those under domain randomization. Brief task descriptions are provided below each sequence. Best viewed in Adobe Acrobat Reader. Animations play automatically or upon clicking [PITH_FULL_IMAGE:figures/full_… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the DOMINO dataset and benchmark for dynamic robotic manipulation (35 tasks, >110K trajectories) and proposes the PUMA VLA architecture, which integrates scene-centric historical optical flow with specialized world queries to implicitly forecast object-centric future states. It reports that PUMA achieves state-of-the-art results with a 6.3% absolute success-rate gain over baselines on dynamic tasks and that dynamic-data training yields transferable spatiotemporal representations for static tasks; code and data are released publicly.

Significance. If the empirical claims are substantiated, the work would be significant for closing the dynamic-environment gap in VLA models by supplying both a new benchmark and an architecture that couples history-aware perception with short-horizon prediction without explicit physics. Public code and data release is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Experiments] Experiments section: the central 6.3% absolute improvement is presented without reported baseline implementation details, statistical significance tests, or explicit task-success definitions, preventing verification that the gain is attributable to the proposed components rather than implementation differences.
  2. [PUMA Architecture] PUMA architecture and results: the claim that scene-centric historical optical flow plus specialized world queries produce reliable short-horizon object-state forecasts rests solely on end-to-end task success; no isolated quantitative evaluation (e.g., future-position or velocity prediction error on held-out trajectories) is provided to confirm the forecasting mechanism operates as described rather than as richer history features.
minor comments (1)
  1. [Abstract] The abstract refers to a 'multi-dimensional evaluation suite' without enumerating the dimensions or metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and verifiability of our empirical claims. We address each major point below and will incorporate the requested details and evaluations into the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central 6.3% absolute improvement is presented without reported baseline implementation details, statistical significance tests, or explicit task-success definitions, preventing verification that the gain is attributable to the proposed components rather than implementation differences.

    Authors: We agree that the current experiments section omits key implementation details for the baselines, statistical significance testing, and precise task-success definitions. In the revision we will expand the Experiments section to include: (1) full baseline implementation details (model checkpoints, training hyperparameters, data augmentation, and optimization settings); (2) statistical significance results (e.g., mean success rates with standard deviations across 5 random seeds and paired t-test p-values); and (3) explicit per-task success criteria (e.g., object displacement thresholds, grasp stability conditions, and temporal windows for dynamic targets). These additions will allow readers to confirm that the reported 6.3% gain stems from the proposed components. revision: yes

  2. Referee: [PUMA Architecture] PUMA architecture and results: the claim that scene-centric historical optical flow plus specialized world queries produce reliable short-horizon object-state forecasts rests solely on end-to-end task success; no isolated quantitative evaluation (e.g., future-position or velocity prediction error on held-out trajectories) is provided to confirm the forecasting mechanism operates as described rather than as richer history features.

    Authors: We acknowledge that the forecasting behavior is currently evidenced only indirectly through end-to-end task success. While the architecture is explicitly designed to couple historical optical flow with world queries for implicit short-horizon prediction, we agree that isolated metrics would provide stronger support. In the revised manuscript we will add a dedicated ablation subsection reporting future-position and velocity prediction errors (L2 distance and angular error) on held-out trajectories for both PUMA and ablated variants (no optical flow, no world queries). This will directly quantify the forecasting contribution beyond richer history features. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on new dataset and architecture are self-contained

full rationale

The paper introduces the DOMINO dataset and PUMA architecture, with performance claims resting on direct experimental comparisons of success rates across tasks. No equations, derivations, or self-citations are shown that reduce the reported 6.3% improvement or the implicit forecasting claim to a fitted parameter, self-definition, or prior author result by construction. The integration of scene-centric optical flow and world queries is presented as a design choice whose value is assessed end-to-end via task metrics rather than assumed tautologically. This is the standard case of an empirical contribution with independent validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only. The approach assumes single-frame observations inherently limit spatiotemporal reasoning and that optical flow plus world queries can substitute for explicit future-state modeling. No free parameters or invented entities are quantified in the provided text.

axioms (1)
  • domain assumption Reliance on single-frame observations restricts spatiotemporal reasoning capabilities in mainstream VLAs
    Stated directly as the primary cause of the performance gap in dynamic environments.
invented entities (1)
  • specialized world queries no independent evidence
    purpose: To implicitly forecast object-centric future states when combined with historical optical flow
    Introduced as part of the PUMA architecture to couple history-aware perception with short-horizon prediction.

pith-pipeline@v0.9.0 · 5516 in / 1350 out tokens · 39568 ms · 2026-05-15T09:46:39.307148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 19 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y.,Dwibedi,D.,Sadigh,D.:Rt-h:Actionhierarchiesusinglanguage.arXivpreprint arXiv:2403.01823 (2024) 13

  3. [3]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025) 1, 11, 13

  4. [4]

    In: 9th Annual Conference on Robot Learn- ing (2025) 1, 9, 11, 13, 26, 28, 29, 30, 31

    Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Galliker, M.Y., et al.:\π0.5: a vision-language-action model with open-world generalization. In: 9th Annual Conference on Robot Learn- ing (2025) 1, 9, 11, 13, 26, 28, 29, 30, 31

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:\π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 3, 9, 11, 13, 26

  6. [6]

    Robotics: Science and Systems XIX (2023) 2, 13

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. Robotics: Science and Systems XIX (2023) 2, 13

  7. [7]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al.: Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669 (2025) 14

  8. [8]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025) 13

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025) 3

  10. [10]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025) 2, 5, 9, 14, 21

  11. [11]

    GitHub repository (1 2025).https: //doi

    starVLA Contributors: Starvla: A lego-like codebase for vision-language-action model developing. GitHub repository (1 2025).https: //doi. org/10 .5281/ zenodo.18264214,https://github.com/starVLA/starVLA9, 11, 12, 27

  12. [12]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Contributors, I.M.: Internvla-m1: A spatially guided vision-language-action frame- work for generalist robot policy. arXiv preprint arXiv:2510.13778 (2025) 9, 11, 13, 25, 26

  13. [13]

    arXiv preprint arXiv:2505.03912 (2025) 1 16 H

    Cui, C., Ding, P., Song, W., Bai, S., Tong, X., Ge, Z., Suo, R., Zhou, W., Liu, Y., Jia, B., Zhao, H., Huang, S., Wang, D.: Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation. arXiv preprint arXiv:2505.03912 (2025) 1 16 H. Fang et al

  14. [14]

    In: International Conference on Machine Learning

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. In: International Conference on Machine Learning. pp. 8469–8488. PMLR (2023) 13

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Ehsani, K., Han, W., Herrasti, A., VanderBilt, E., Weihs, L., Kolve, E., Kembhavi, A., Mottaghi, R.: Manipulathor: A framework for visual object manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 4497–4506 (2021) 14

  16. [16]

    arXiv preprint arXiv:2505.02152 (2025) 13

    Fan, C., Jia, X., Sun, Y., Wang, Y., Wei, J., Gong, Z., Zhao, X., Tomizuka, M., Yang, X., Yan, J., et al.: Interleave-vla: Enhancing robot manipulation with inter- leaved image-text instructions. arXiv preprint arXiv:2505.02152 (2025) 13

  17. [17]

    In: Scandinavian conference on Image analysis

    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on Image analysis. pp. 363–370. Springer (2003) 23

  18. [18]

    IEEE Robotics and Automation Letters (2020) 14

    James, S., Ma, Z., Rovick Arrojo, D., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters (2020) 14

  19. [19]

    arXiv preprint arXiv:2510.04246 (2025) 13

    Jang,H.,Yu,S.,Kwon,H.,Jeon,H.,Seo,Y.,Shin,J.:Contextvla:Vision-language- action model with amortized multi-frame context. arXiv preprint arXiv:2510.04246 (2025) 13

  20. [20]

    In: International Conference on Machine Learning

    Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anand- kumar, A., Zhu, Y., Fan, L.: Vima: Robot manipulation with multimodal prompts. In: International Conference on Machine Learning. pp. 14975–15022. PMLR (2023) 13

  21. [21]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Jiang, Z., Xie, Y., Lin, K., Xu, Z., Wan, W., Mandlekar, A., Fan, L.J., Zhu, Y.: Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 16923–16930. IEEE (2025) 14

  22. [22]

    Artificial intelligence101(1-2), 99–134 (1998) 3

    Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial intelligence101(1-2), 99–134 (1998) 3

  23. [23]

    In: Robotics: Science and Systems (2024) 14

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. In: Robotics: Science and Systems (2024) 14

  24. [24]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 1, 9, 10, 11, 13, 24, 25, 28, 29, 30, 31

  25. [25]

    In: 8th Annual Conference on Robot Learning 1, 3, 9, 11, 13, 24, 25

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision-language-action model. In: 8th Annual Conference on Robot Learning 1, 3, 9, 11, 13, 24, 25

  26. [26]

    HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

    Koo, M., Choi, D., Kim, T., Lee, K., Kim, C., Seo, Y., Shin, J.: Hamlet: Switch your vision-language-action model into a history-aware policy. arXiv preprint arXiv:2510.00695 (2025) 13

  27. [27]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024) 14

  28. [28]

    HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

    Lin, M., Ding, P., Wang, S., Zhuang, Z., Liu, Y., Tong, X., Song, W., Lyu, S., Huang, S., Wang, D.: Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models. arXiv preprint arXiv:2512.09928 (2025) 13 Towards Generalizable Robotic Manipulation in Dynamic Environments 17

  29. [29]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023) 14

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023) 14

  30. [30]

    arXiv preprint arXiv:2508.19257 (2025) 13

    Liu, C., Zhang, J., Li, C., Zhou, Z., Wu, S., Huang, S., Duan, H.: Ttf-vla: Tempo- ral token fusion via pixel-attention integration for vision-language-action models. arXiv preprint arXiv:2508.19257 (2025) 13

  31. [31]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023) 8, 23

  32. [32]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024) 3, 9, 11, 13, 24, 25

  33. [33]

    International Conference on Learning Representations (2019) 22

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. International Conference on Learning Representations (2019) 22

  34. [34]

    In: Thirty-fifth Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) 14

    Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al.: Isaac gym: High perfor- mance gpu based physics simulation for robot learning. In: Thirty-fifth Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) 14

  35. [35]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022) 14

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022) 14

  36. [36]

    In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) 14

    Mu, T., Ling, Z., Xiang, F., Yang, D.C., Li, X., Tao, S., Huang, Z., Jia, Z., Su, H.: Maniskill: Generalizable manipulation skill benchmark with large-scale demon- strations. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) 14

  37. [37]

    In: Robotics: Science and Systems (2024) 14

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: Robotics: Science and Systems (2024) 14

  38. [38]

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learn- ing datasets and rt-x models: Open x-embodiment collaboration 0. pp. 6892–6903. IEEE (2024) 2, 14

  39. [39]

    Optical Memory and Neural Networks 34(Suppl 1), S72–S82 (2025) 13

    Patratskiy, M.A., Kovalev, A.K., Panov, A.I.: Spatial traces: Enhancing vla mod- els with spatial-temporal understanding. Optical Memory and Neural Networks 34(Suppl 1), S72–S82 (2025) 13

  40. [40]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 9, 11, 25, 26

  41. [41]

    arXiv preprint arXiv:2310.13724 (2023) 14

    Puig, X., Undersander, E., Szot, A., Cote, M.D., Yang, T.Y., Partsey, R., Desai, R., Clegg, A.W., Hlavac, M., Min, S.Y., et al.: Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724 (2023) 14

  42. [42]

    arXiv preprint arXiv:2402.08191 (2024) 14

    Pumacay, W., Singh, I., Duan, J., Krishna, R., Thomason, J., Fox, D.: The colos- seum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191 (2024) 14

  43. [43]

    arXiv preprint arXiv:2602.03983 (2026) 14

    Qiu, W., Huang, T., Feng, A., Ying, R.: Efficient long-horizon vision- language-action models via static-dynamic disentanglement. arXiv preprint arXiv:2602.03983 (2026) 14

  44. [44]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., 18 H. Fang et al. Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024),https://arxiv.org/abs/2408.007148, 23

  45. [45]

    Transactions on Machine Learning Research 13

    Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S.G., Novikov, A., Barth-maron, G., Giménez, M., Sulsky, Y., Kay, J., Springenberg, J.T., et al.: A generalist agent. Transactions on Machine Learning Research 13

  46. [46]

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024) 8, 23

  47. [47]

    Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    Shi, H., Xie, B., Liu, Y., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., Huang, G.: Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236 (2025) 1, 13

  48. [48]

    In: Conference on robot learning

    Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: Conference on robot learning. pp. 894–906. PMLR (2022) 13

  49. [49]

    arXiv preprint arXiv:2508.10333 (2025) 14

    Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y., Tang, F., Wang, D., Li, H.: Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333 (2025) 14

  50. [50]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 13

  51. [51]

    In: Conference on Robot Learning

    Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–1736. PMLR (2023) 2, 14

  52. [52]

    arXiv preprint arXiv:2509.09372 (2025) 1, 9, 11, 25, 26

    Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., Huang, S., Tang, Y., Wang, W., Zhang, R., Liu, J., Wang, D.: Vla- adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372 (2025) 1, 9, 11, 25, 26

  53. [53]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Wang, Y., Yue, Z., Zeng, H., Wang, D., McAuley, J.: Train once, deploy anywhere: Matryoshka representation learning for multimodal recommendation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 13461–13472 (2024) 14

  54. [54]

    arXiv preprint arXiv:2412.13877 (2024) 14

    Wu, K., Hou, C., Liu, J., Che, Z., Ju, X., Yang, Z., Li, M., Zhao, Y., Xu, Z., Yang, G., et al.: Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877 (2024) 14

  55. [55]

    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 2, 5

    Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: SAPIEN: A simulated part-based interactive environment. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 2, 5

  56. [56]

    arXiv preprint arXiv:2510.17950 (2025) 14

    Yakefu, A., Xie, B., Xu, C., Zhang, E., Zhou, E., Jia, F., Yang, H., Fan, H., Zhang, H., Peng, H., et al.: Robochallenge: Large-scale real-robot evaluation of embodied policies. arXiv preprint arXiv:2510.17950 (2025) 14

  57. [57]

    arXiv preprint arXiv:2512.22615 (2025) 3

    Ye, J., Gong, S., Gao, J., Fan, J., Wu, S., Bi, W., Bai, H., Shang, L., Kong, L.: Dream-vl & dream-vla: Open vision-language and vision-language-action models with diffusion language model backbone. arXiv preprint arXiv:2512.22615 (2025) 3

  58. [58]

    In: Conference on robot learning

    Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., Levine, S.: Meta- world:Abenchmarkandevaluationformulti-taskandmetareinforcementlearning. In: Conference on robot learning. pp. 1094–1100. PMLR (2020) 14 Towards Generalizable Robotic Manipulation in Dynamic Environments 19

  59. [59]

    Navid: Video-based vlm plans the next step for vision-and-language navigation,

    Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852 (2024) 13

  60. [60]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, S., Xu, Z., Liu, P., Yu, X., Li, Y., Gao, Q., Fei, Z., Yin, Z., Wu, Z., Jiang, Y.G., et al.: Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11142–11152 (2025) 14

  61. [61]

    arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

    Zhang, W., Liu, H., Qi, Z., Wang, Y., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al.: Dreamvla: a vision-language-action model dreamed with compre- hensive world knowledge. arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

  62. [62]

    IEEE Robotics and Automation Letters (2025) 14

    Zhang, Y., Wang, R., Chen, X.: Dynamic behavior cloning with temporal feature prediction: Enhancing robotic arm manipulation in moving object tasks. IEEE Robotics and Automation Letters (2025) 14

  63. [63]

    Robotics: Science and Systems XIX (2023) 3, 6, 9, 28, 29, 30, 31

    Zhao, T., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manip- ulation with low-cost hardware. Robotics: Science and Systems XIX (2023) 3, 6, 9, 28, 29, 30, 31

  64. [64]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Zheng, R., Liang, Y., Huang, S., Gao, J., Daumé III, H., Kolobov, A., Huang, F., Yang, J.: Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345 (2024) 13

  65. [65]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Zhu, Y., Wong, J., Mandlekar, A., Martín-Martín, R., Joshi, A., Lin, K., Mad- dukuri, A., Nasiriany, S., Zhu, Y.: robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293 (2020) 14

  66. [66]

    TowardsGeneralizableRobotic Manipulation in Dynamic Environments

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 13 20 H. Fang et al. Towards Generalizable Robotic Manipulation in Dynamic Environments Supplementary Mate...