pith. sign in

arxiv: 2603.04639 · v2 · pith:PF2SNA4Nnew · submitted 2026-03-04 · 💻 cs.RO · cs.AI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robotic manipulationmemory mechanismsvision-language-action modelslong-horizon tasksbenchmark evaluationtask-dependent performancegeneralist policies
0
0 comments X

The pith

Memory representations for robotic policies show effectiveness that depends on the specific task rather than a single best design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a consistent way to compare how different memory mechanisms help vision-language-action models handle robotic tasks that unfold over many steps and require recalling past information. It organizes evaluation around four memory categories and builds sixteen tasks to test them. Multiple variants are then created by adding different memory structures to one base model. A sympathetic reader would care because these kinds of tasks appear in everyday manipulation yet current models lack reliable ways to track history. If the findings are correct, future work can focus on choosing or combining memory types according to what a given task actually requires instead of searching for one universal solution.

Core claim

The authors claim that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. This conclusion rests on a benchmark of sixteen manipulation tasks built under a taxonomy of temporal, spatial, object, and procedural memory, together with experiments on fourteen memory-augmented variants of a single base model.

What carries the argument

A taxonomy that divides memory requirements into temporal, spatial, object, and procedural categories, used to structure both the creation of test tasks and the comparison of integration strategies.

If this is right

  • Model builders should match memory mechanisms to the dominant requirement of a task, such as counting steps for temporal needs or recovering from occlusions for object needs.
  • Standardized benchmarks make it possible to measure incremental progress in history-dependent robotic manipulation instead of relying on isolated demonstrations.
  • Generalist policies may need to incorporate multiple memory types or switch between them when facing varied task demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid memory systems that detect task features and activate the most suitable representation could extend performance across a wider range of scenarios.
  • Applying the same taxonomy to physical robot experiments would reveal whether simulation results hold when sensor noise and actuation errors are present.
  • Similar task-dependent patterns may appear in other sequential control problems such as navigation or assembly planning.

Load-bearing premise

The four memory categories accurately capture the needs of real long-horizon robotic manipulation and the sixteen tasks are representative enough to support general conclusions.

What would settle it

A follow-up test in which one memory design outperforms all others on every task in the set or on a fresh collection of long-horizon tasks that still fit the same overall description.

Figures

Figures reproduced from arXiv: 2603.04639 by Chelsea Finn, Haoran Zhang, Hongze Fu, Jayjun Lee, Jianing Yang, Joyce Chai, Nima Fazeli, Yinpei Dai, Yuejiang Liu.

Figure 1
Figure 1. Figure 1: RoboMME is a large-scale robotic benchmark for evaluating memory-augmented manipulation, comprising four task suites that emphasize distinct memory demands. (1) The Counting suite targets temporal memory, requiring robots to accumulate and reason over past events, e.g., counting placed green cubes and stopping correctly (top-left). (2) The Permanence suite focuses on spatial memory, requiring tracking of o… view at source ↗
read the original abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoboMME, a standardized benchmark of 16 long-horizon robotic manipulation tasks organized under a taxonomy of temporal, spatial, object, and procedural memory. It constructs 14 memory-augmented variants of the π0.5 VLA backbone via different integration strategies (recurrent, attention-over-history, external memory) and reports that memory effectiveness is highly task-dependent, with each design showing distinct advantages and limitations.

Significance. If the results hold after addressing controls, this benchmark could standardize evaluation of memory mechanisms in VLA models and clarify design trade-offs for long-horizon robotics. The public release of code and videos supports reproducibility and is a clear strength.

major comments (2)
  1. [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.
  2. [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.
minor comments (2)
  1. [§3] The taxonomy is introduced without explicit validation against real-world long-horizon task distributions; a short discussion or reference to how the 16 tasks were selected would strengthen the claim of representativeness.
  2. [Figures/Tables] Figure legends and tables comparing the 14 variants should include explicit capacity metrics to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.

    Authors: We agree that parameter counts and FLOPs are necessary for transparent comparison. In the revised version we will add a dedicated table reporting parameter counts and estimated FLOPs for each of the 14 variants relative to the π0.5 backbone. On capacity-matched controls, the variants modify only the memory integration module while freezing the core VLA weights and architecture; this keeps overall capacity differences modest (typically <5% additional parameters). Nevertheless, to directly address the concern we will include a new paragraph discussing capacity implications and, where feasible, report results from a capacity-matched ablation that equalizes total parameters across representative variants. revision: partial

  2. Referee: [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.

    Authors: We accept that the current presentation lacks sufficient statistical detail. We will augment all result figures with error bars computed over multiple random seeds and add statistical significance tests (paired t-tests with Bonferroni correction) between memory variants on each task. Task construction details appear in Section 3, but we will expand this section with explicit criteria used to isolate temporal, spatial, object, and procedural memory requirements. No data points were excluded from the reported results; we will state this explicitly and describe the full evaluation protocol (including number of trials per task) to improve reproducibility and generalizability assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

This is a purely empirical benchmarking paper that introduces a taxonomy of memory types, constructs 16 tasks, and evaluates 14 variants on a fixed backbone through direct experimentation. No mathematical derivations, first-principles predictions, or fitted parameters are claimed to produce new results; the central claims rest on observed performance differences across tasks. The taxonomy and variants are presented as design choices for systematic comparison rather than outputs derived from prior results within the paper. No self-citation is used to justify uniqueness or forbid alternatives, and no step reduces to an input by construction. The skeptic concern about capacity matching is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that the four-category memory taxonomy is sufficient and that tasks constructed under it reflect genuine robotic needs; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The taxonomy of temporal, spatial, object, and procedural memory covers the relevant history-dependent aspects of robotic manipulation tasks.
    Invoked when constructing the 16 tasks and interpreting results as generalizable.

pith-pipeline@v0.9.0 · 5733 in / 1198 out tokens · 28784 ms · 2026-05-21T12:23:41.670571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  2. vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models

    cs.AI 2026-03 accept novelty 5.0

    vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.

  3. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 3 Pith papers · 9 internal anchors

  1. [1]

    Human memory: A proposed system and its control processes

    Richard C Atkinson and Richard M Shiffrin. Human memory: A proposed system and its control processes. InPsychology of learning and motivation, volume 2, pages 89–195. Elsevier, 1968

  2. [2]

    Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters

    Stephanie J Babb and Ruth M Johnson. Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters. Behaviour, brain & cognition, 26 (2, 2010), 2011

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 17–40

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, and et al.𝜋0.5: a vision-language-action model with open-world generalization. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Resea...

  5. [5]

    Learning to act anywhere with task-centric latent actions.RSS, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.RSS, 2025

  6. [6]

    Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

  7. [7]

    Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

    Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

  8. [8]

    The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

    Neil Burgess, Eleanor A Maguire, and John O’Keefe. The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

  9. [9]

    History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

    Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-aware visuomotor policy learning via point tracking.arXiv preprint arXiv:2509.17141, 2025

  10. [10]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

  11. [11]

    Kovalev, and Aleksandr I

    Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning. InProceedings of the 7th Robot Learning Workshop at ICLR, 2025. arXiv:2502.10550

  12. [12]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  13. [13]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025. 11 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

  14. [14]

    Think, act, and ask: Open-world interactive personalized robot navigation

    Yinpei Dai, Run Peng, Sikai Li, and Joyce Chai. Think, act, and ask: Open-world interactive personalized robot navigation. In2024 IEEE international conference on robotics and automation (ICRA), pages 3296–3303. IEEE, 2024

  15. [15]

    Racer: Rich language-guided failure recovery policies for imitation learning

    Yinpei Dai, Jayjun Lee, Nima Fazeli, and Joyce Chai. Racer: Rich language-guided failure recovery policies for imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15657–15664. IEEE, 2025

  16. [16]

    Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

    Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

  17. [17]

    Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

    Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

  18. [18]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

  19. [19]

    Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

    Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

  20. [20]

    Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

    Michael E Hasselmo, James R Hinman, Holger Dannenberg, and Chantal E Stern. Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

  21. [21]

    Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  22. [22]

    arXiv:2510.04246 [cs]

    Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision- language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025

  23. [23]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  24. [24]

    HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

    Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jin- woo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

  25. [25]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  26. [26]

    A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

    Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

  27. [27]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

  28. [28]

    Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 12 RoboMME: Benchmarking and Understanding Memory for Rob...

  29. [29]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

  30. [30]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  31. [31]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks, 2021

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  33. [33]

    Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

  34. [34]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

  35. [35]

    Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

    Larry R Squire. Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

  36. [36]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

  37. [37]

    Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

  38. [38]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

    Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  39. [39]

    Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

    Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025

  40. [40]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

    Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

  43. [43]

    Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372, 2025

  44. [44]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 13 RoboMME: Benchmarking and Understanding Memory for Robo...

  45. [45]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

  46. [46]

    Test-Time Training Done Right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

  47. [47]

    Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

    Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

  48. [48]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 14 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Appendix Outline •A. Model Architectures in the ...

  49. [49]

    put one blue cube into the bin and press the button to stop

    Placeonecube of a specified color.e.g., “put one blue cube into the bin and press the button to stop”

  50. [50]

    put one blue cube and two green cubes into the bin and press the button to stop

    Place cubes oftwospecified colors.e.g., “put one blue cube and two green cubes into the bin and press the button to stop”

  51. [51]

    put one blue cube, one green cube and two red cubes into the bin and press the button to stop

    Place cubes ofthreespecified colors.e.g., “put one blue cube, one green cube and two red cubes into the bin and press the button to stop”. Task CharacteristicsTo introduce dynamic and history-dependent behavior, we consider two settings randomly selected at environment initialization: 1.Static:all cubes are present at the beginning of the episode. 2.Strea...

  52. [52]

    pick up the blue cube and place it on the target, then press the button to stop

    Pick and place foronetime.e.g., “pick up the blue cube and place it on the target, then press the button to stop”

  53. [53]

    pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop

    Pickandplaceformultipletimes.e.g., “pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop”. Successful Pick-and-PlaceA pick-and-place is considered successful when the robot lifts the cube above a predefined height threshold while maintaining a valid grasp, and then lowers it onto the target ...

  54. [54]

    The robot picks up a wrong cube

  55. [55]

    The button is pressed before all required repetitions are completed

  56. [56]

    The robot performs more pick-and-place repetitions than specified. 38 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure14:SwingXtimes Task Example.In this instance, the goal is pick up the green cube, first move it to the right-side target, then put down the cube on the left-side target (i.e., swing between targets one ...

  57. [57]

    Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop

    Performoneswing cycle.e.g., “Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop”

  58. [58]

    Performmultipleswing cycles.e.g., “Pick up the green cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.”. Successful ReachA reach is successful when the cube is held nearly upright and positioned within a small tolerance of the...

  59. [59]

    The robot picks up the wrong cube

  60. [60]

    The button is pressed before all repetitions are completed

  61. [61]

    press the button to stop the cube exactly at the target on its third visit

    The robot reaches either target more than the specified number of times (excessive swings). 39 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure15:StopCube Task Example.In this instance, the goal is press the button to stop the cube exactly at the target on its second visit. E.4. StopCube Table 16:StopCubeTask Configura...

  62. [62]

    StaticBehavior:Tosucceed, therobotmustpositionitsend-effectoroverthebuttonandremainstatic(hovering) until the correct timing

  63. [63]

    The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button

    Immediate Stop:Pressing the button stops the cube instantly. The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button. Success Criteria • Precise Synchronization:The button must be pressed strictly within the time window when the cube overlaps with the target. •C...

  64. [64]

    Watch the video carefully, then pick up the container hiding the green cube

    Pickonecontainer hiding a specified cube e.g., “Watch the video carefully, then pick up the container hiding the green cube.”

  65. [65]

    Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

    Picktwocontainers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green/blue) is...

  66. [66]

    First press the button on the table, then pick up the container hiding the green cube

    Pickonespecified container by cube color e.g., “First press the button on the table, then pick up the container hiding the green cube.”

  67. [67]

    First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

    Picktwospecified containers sequentially (order matters) e.g., “First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot need to presses the button at the beginning. During the press, multiple containers are concurrently p...

  68. [69]

    The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 42 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure18:VideoUnmaskSwap Task Example.In this instance, the robot first watches the video, then picks up the container hiding the blue cube, followed by the one hiding the green cube. The vide...

  69. [70]

    Watch the video carefully, then pick up the container hiding the blue cube

    Pickonespecified container e.g., “Watch the video carefully, then pick up the container hiding the blue cube.”

  70. [71]

    Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube

    Picktwospecified containers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green...

  71. [72]

    First press both buttons on the table, then pick up the container hiding the red cube

    Pickonespecified container e.g., “First press both buttons on the table, then pick up the container hiding the red cube.”

  72. [73]

    First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

    Picktwospecified containers sequentially (order matters) e.g., “First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot needs to press both buttons. During the pressing, multiple containers are placed on the table to en...

  73. [74]

    The robot picks up any container before completing the button-press phase

  74. [75]

    first press the button, then pick up all highlighted cubes, finally press the button again to stop

    The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 44 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure20:PickHighlight Task Example.In this instance, the goal is first press the button, then pick up all cubeshighlighted by white area, finally press the button again to stop. During the bu...

  75. [76]

    The robot fails to press the button before attempting a pick

  76. [77]

    Watch the video carefully, then pick up the same cube that was previously picked twice, and finally press the button to stop

    The robot picks up any wrong cubes (i.e., a cube that was not highlighted). 45 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure21:VideoRepick Task Example.In this instance, the robot first watches a video, then picks up the same previously picked cube twice, and finally presses the button to stop. Red-bordered frames d...

  77. [78]

    the robot picks up the wrong cube,

  78. [79]

    the button is pressed before finishing𝑁repetitions, or

  79. [80]

    Watch the video carefully and place the blue cube on the target where it was placed immediately before the button was pressed

    the robot completes more than𝑁repetitions. 46 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure22:VideoPlaceButton Task Example.In this example, the robot first observes a video depicting an interleaved sequence of cube placements and button presses, and then places the cube onto the correct target corresponding to its ...

  80. [81]

    The robot picks up an incorrect peg

Showing first 80 references.