pith. machine review for the scientific record. sign in

arxiv: 2508.19236 · v2 · submitted 2025-08-26 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:39 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-action modelsrobotic manipulationmemory banklong-horizon tasksperceptual-cognitive tokensdiffusion action experttemporal context
0
0 comments X

The pith

MemoryVLA adds a perceptual-cognitive memory bank to vision-language-action models to supply temporal context for long-horizon robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream vision-language-action models treat each observation as independent and therefore falter when a task requires remembering earlier steps. The paper proposes a memory system modeled on human working memory and episodic recall that encodes current observations into perceptual and cognitive tokens, stores consolidated details and semantics in a dedicated bank, and retrieves relevant entries for fusion with the current state. These enriched tokens then condition a diffusion model that generates action sequences. The method is evaluated on simulation suites and on twelve real-world tasks across three robots, with the largest gains appearing on tasks that span many steps.

Core claim

A pretrained vision-language model produces perceptual and cognitive tokens that serve as working memory. These tokens interact with a Perceptual-Cognitive Memory Bank that retains low-level visual details and high-level semantic summaries. Adaptive retrieval selects relevant past entries, fuses them with the current tokens, and merges redundancies before updating the bank. The resulting memory-conditioned tokens drive a diffusion-based action expert that outputs sequences aware of temporal dependencies.

What carries the argument

The Perceptual-Cognitive Memory Bank, which stores and adaptively retrieves low-level perceptual details together with high-level semantic gist from prior observations.

If this is right

  • Robots can complete manipulation sequences that span many steps without requiring hand-crafted history features.
  • Gains are concentrated on tasks with explicit temporal dependencies while general skills remain competitive.
  • The same memory bank can be paired with any pretrained vision-language model and any diffusion action head.
  • Success rates of 71.9 percent on SimplerEnv-Bridge, 72.7 percent on Fractal, 96.5 percent on LIBERO-5, and 84.0 percent on twelve real-world tasks become achievable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-and-fusion pattern could be tested in other sequential domains such as autonomous driving or game playing where history matters.
  • Scaling the bank size or adding decay mechanisms might be needed if the number of stored entries grows large.
  • Comparing the bank against a simple transformer memory layer on the same tasks would isolate the benefit of the perceptual-cognitive split.

Load-bearing premise

Adaptive retrieval, fusion, and redundancy merging will consistently deliver useful past context without injecting noise or stale information that degrades action generation.

What would settle it

Ablating the memory bank entirely and measuring no drop (or an increase) in success rate specifically on the long-horizon real-world tasks would falsify the claim that the bank supplies necessary temporal context.

read the original abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, LIBERO-5 suites and Mikasa-Robo, it achieves 71.9%, 72.7%, 96.5%, and 41.2% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge and +11.8 gain on Mikasa-Robo. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MemoryVLA, a Cognition-Memory-Action framework for vision-language-action (VLA) models in robotic manipulation. It introduces a Perceptual-Cognitive Memory Bank that stores low-level perceptual details and high-level semantic tokens from a pretrained VLM, with working memory performing adaptive retrieval, fusion, and redundancy merging to supply temporal context. A memory-conditioned diffusion action expert then generates actions. The method is evaluated on 150+ simulation and real-world tasks across three robots, reporting success rates of 71.9% on SimplerEnv-Bridge, 72.7% on Fractal, 96.5% on LIBERO-5, 41.2% on Mikasa-Robo, and 84.0% on 12 real-world tasks (with +26 gain on long-horizon subsets), outperforming baselines CogACT and pi-0.

Significance. If the memory bank's operations are shown to be the causal driver of the reported gains, the work would meaningfully advance VLA models for non-Markovian, long-horizon manipulation by explicitly incorporating cognitive-inspired memory mechanisms. The breadth of evaluation across public simulation suites and real-world tasks on multiple robots provides a solid empirical foundation for assessing temporal awareness in action generation.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claim that the Perceptual-Cognitive Memory Bank's adaptive retrieval, fusion, and redundancy-merging steps supply temporally relevant context and drive the +26 improvement on long-horizon tasks is not supported by any ablation that isolates these operations. No results are shown for a controlled no-memory VLA variant that retains the same VLM encoder and diffusion expert while removing the memory bank, leaving open the possibility that gains arise from data scale, training differences, or the expert architecture instead.
  2. [Results] Results section: Reported success rates (e.g., 84.0% real-world, 71.9% on Bridge) are given as point estimates without error bars, number of evaluation trials, or statistical tests comparing to baselines, which is load-bearing for claims of consistent outperformance on temporally dependent tasks.
minor comments (2)
  1. [Abstract] The abstract states evaluation on '150+ simulation and real-world tasks' but does not break down the exact counts per suite or identify which of the 12 real-world tasks are the long-horizon subset used for the +26 gain.
  2. [Method] Method description of the memory bank's redundancy-merging step would benefit from pseudocode or explicit update rules to clarify how stale or noisy entries are handled during fusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim that the Perceptual-Cognitive Memory Bank's adaptive retrieval, fusion, and redundancy-merging steps supply temporally relevant context and drive the +26 improvement on long-horizon tasks is not supported by any ablation that isolates these operations. No results are shown for a controlled no-memory VLA variant that retains the same VLM encoder and diffusion expert while removing the memory bank, leaving open the possibility that gains arise from data scale, training differences, or the expert architecture instead.

    Authors: We agree that an explicit controlled ablation isolating the memory bank is necessary to strengthen the causal link between the memory mechanisms and the reported gains. In the revised manuscript we will add a new ablation that compares the full MemoryVLA model against a no-memory variant that uses exactly the same pretrained VLM encoder and diffusion action expert, with the memory bank, retrieval, fusion, and redundancy-merging components removed. This will be reported in the Experiments section with the same evaluation protocol. revision: yes

  2. Referee: [Results] Results section: Reported success rates (e.g., 84.0% real-world, 71.9% on Bridge) are given as point estimates without error bars, number of evaluation trials, or statistical tests comparing to baselines, which is load-bearing for claims of consistent outperformance on temporally dependent tasks.

    Authors: We acknowledge that the current presentation lacks measures of variability and statistical comparison. In the revised Results section we will report the exact number of evaluation trials per task (50 trials for simulation suites and 10 trials for each real-world task), include error bars showing standard deviation across three independent training seeds, and add paired t-test p-values comparing MemoryVLA against CogACT and pi-0 on the long-horizon subsets. Updated tables and figures will reflect these additions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent benchmark comparisons

full rationale

The paper presents an architectural proposal (Perceptual-Cognitive Memory Bank with retrieval/fusion/merging) whose performance is measured via direct success-rate comparisons on public suites (SimplerEnv-Bridge, LIBERO-5, Mikasa-Robo) and 12 real-world tasks against external baselines (CogACT, pi-0). No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input; the reported gains (+14.6 on Bridge, +26 on long-horizon real tasks) are obtained from held-out evaluation rather than any internal normalization or self-referential loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that robotic manipulation is non-Markovian and benefits from explicit memory consolidation; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Robotic manipulation tasks are inherently non-Markovian and require temporal context.
    Explicitly stated in the opening sentence of the abstract as the core motivation.
invented entities (1)
  • Perceptual-Cognitive Memory Bank no independent evidence
    purpose: Stores and consolidates low-level perceptual details and high-level semantic gist from working memory for later retrieval.
    New component introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5648 in / 1216 out tokens · 51123 ms · 2026-05-15T20:39:39.773638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence / PhiForcing hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  3. ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  6. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  7. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  8. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  9. AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

  10. PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

  11. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  12. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  13. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  14. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...

  15. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.

  16. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  17. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  18. A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

    cs.RO 2026-05 unverdicted novelty 5.0

    The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000...

  19. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  20. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  21. Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    cs.RO 2026-04 unverdicted novelty 5.0

    A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

  22. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 19 Pith papers · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  3. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

  5. [5]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and ...

  6. [6]

    Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550,

    Egor Cherepanov, Nikita Kachaev, Alexey K Kovalev, and Aleksandr I Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550,

  7. [7]

    Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152,

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152,

  8. [8]

    Vla-os: Structuring and dissecting planning representa- tions and paradigms in vision-language-action models.arXiv preprint arXiv:2506.17561,

    Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representa- tions and paradigms in vision-language-action models.arXiv preprint arXiv:2506.17561,

  9. [9]

    Rvt: Robotic view transformer for 3d object manipulation

    12 Published as a conference paper at ICLR 2026 Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pp. 694–710. PMLR,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  11. [11]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  12. [12]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty El- lis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  13. [13]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

  14. [14]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Opti- mizing speed and success.arXiv preprint arXiv:2502.19645,

  15. [15]

    Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a

    Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yiz...

  16. [16]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024b. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledg...

  17. [17]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  18. [18]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

  19. [19]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual- language-action model.arXiv preprint arXiv:2501.15830,

  20. [20]

    Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996,

    Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996,

  21. [21]

    Spatialactor: Exploring disentangled spatial representations for robust robotic manipula- tion.arXiv preprint arXiv:2511.09555,

    14 Published as a conference paper at ICLR 2026 Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Spatialactor: Exploring disentangled spatial representations for robust robotic manipula- tion.arXiv preprint arXiv:2511.09555,

  22. [22]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  23. [23]

    Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071,

    Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071,

  24. [24]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

  25. [25]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  26. [26]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

  27. [27]

    Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511,

    Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, et al. Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511,

  28. [28]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14203–14214, 2025a. Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Fe...

  29. [29]

    Sigmoid loss for language image pre-training

    15 Published as a conference paper at ICLR 2026 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986,

  30. [30]

    4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, and Xingping Dong. Groun...

  31. [31]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

  32. [32]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Henry Zheng, Hao Shi, Yong Xien Chng, Rui Huang, Zanlin Ni, Tianyi Tan, Qihang Peng, Yepeng Weng, Zhongchao Shi, and Gao Huang. Denseg: Alleviating vision-language feature sparsity in multi-view 3d visual grounding. InAutonomous Grand Challenge CVPR 2024 Workshop, volume 2, pp. 6, 2024a. Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng...

  33. [33]

    18 B.2 Simulation Evaluation

    16 Published as a conference paper at ICLR 2026 APPENDIX A LLM Usage 18 B Robustness and Generalization Evaluation 18 B.1 Real-world Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Simulation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C Additional Training Details 19 C.1 Hyper-paramete...

  34. [34]

    21 Published as a conference paper at ICLR 2026 Real-worldWe collect real demonstrations on Franka and WidowX robots using a fixed third- person RGB setup, as shown in Fig

    comprises five memory-dependent manip- ulation tasks, each with 250 officially provided demonstrations, using∆end-effector control. 21 Published as a conference paper at ICLR 2026 Real-worldWe collect real demonstrations on Franka and WidowX robots using a fixed third- person RGB setup, as shown in Fig. 8 and

  35. [35]

    SimplerEnv-FractalModels are trained for 80k steps on RT-1. The benchmark defines two pro- tocols: Visual Matching (VM), which mirrors the real-robot setup, and Visual Aggregation (V A), which perturbs background, lighting, distractors, and textures to test robustness. The dataloader design and memory length follow the same setup as in SimplerEnv-Bridge. ...

  36. [36]

    We reuse the same dataloader setup as in LIBERO, and set the memory length to

    Mikasa-RoboFollowing Mikasa-Robo (Cherepanov et al., 2025), we adopt the standard protocol with five tasks and train jointly on all 1,250 demonstrations for 20k steps, using128×128RGB observations and∆end-effector control. We reuse the same dataloader setup as in LIBERO, and set the memory length to

  37. [37]

    The gen- eral tasks contain 50-150 demonstrations per task, while long-horizon temporal tasks use 200-300 demonstrations per task

    Real-worldModels are trained for 5k-20k steps depending on task and dataset size. The gen- eral tasks contain 50-150 demonstrations per task, while long-horizon temporal tasks use 200-300 demonstrations per task. The memory length is set to 16 for general tasks and 256 for long-horizon temporal tasks. C.4 DATAAUGMENTATION We apply standard per-frame augme...

  38. [38]

    23 Published as a conference paper at ICLR 2026 •Pick Place Order:carrot, banana, and orange must be picked and placed in sequence

    Empty grasps with clear counting intent incur a 5-point penalty. 23 Published as a conference paper at ICLR 2026 •Pick Place Order:carrot, banana, and orange must be picked and placed in sequence. Each correct step earns 30, with a 10 bonus for full completion. Any order violation terminates the attempt. •Clean Restaurant Table:five objects in total. Each...

  39. [39]

    Filtered

    E CASESTUDY OFMEMORYRETRIEVAL To provide a direct view of how the memory mechanism functions, Fig. 10 visualizes the retrieved memory elements and their attention weights on the real-world and simulation tasks. The model consistently attends to past frames that resolve decision-relevant ambiguities absent from the current observation. In the real-world Ch...

  40. [40]

    Table 10:Additional ablation on memory length for both real-world and LIBERO-90 tasks

    to LIBERO-Long-90 and a real-world long-horizon task. Table 10:Additional ablation on memory length for both real-world and LIBERO-90 tasks. (a) Real-World: Clean Table & Count Memory Length Success Rate 64 78 256 (Base) 84 512 81 (b) LIBERO-Long-90 Tasks Memory Length Success Rate 8 94.2 16 (Base) 95.6 32 95.6 Table 11:Ablation on the Number of Cognitive...

  41. [41]

    Push the{color 1, color 2, and color 3} buttons in sequence

    Latency (HGX H20) Throughput (HGX H20) Memory Baseline 0.187 s 85.6 Hz 0.236 s 67.8 Hz 15.8 GB MemoryVLA 0.194 s 82.5 Hz 0.246 s 65.0 Hz 16.6 GB I ZERO-SHOTTASKGENERALIZATION In addition to visual OOD tests, we have added task generalization experiments to evaluate zero shot performance on unseen task categories. As shown in Tab. 16, We use Apple To Baske...