RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Chelsea Finn; Haoran Zhang; Hongze Fu; Jayjun Lee; Jianing Yang; Joyce Chai; Nima Fazeli; Yinpei Dai; Yuejiang Liu

arxiv: 2603.04639 · v2 · pith:PF2SNA4Nnew · submitted 2026-03-04 · 💻 cs.RO · cs.AI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai , Hongze Fu , Jayjun Lee , Yuejiang Liu , Haoran Zhang , Jianing Yang , Chelsea Finn , Nima Fazeli

show 1 more author

Joyce Chai

This is my paper

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic manipulationmemory mechanismsvision-language-action modelslong-horizon tasksbenchmark evaluationtask-dependent performancegeneralist policies

0 comments

The pith

Memory representations for robotic policies show effectiveness that depends on the specific task rather than a single best design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a consistent way to compare how different memory mechanisms help vision-language-action models handle robotic tasks that unfold over many steps and require recalling past information. It organizes evaluation around four memory categories and builds sixteen tasks to test them. Multiple variants are then created by adding different memory structures to one base model. A sympathetic reader would care because these kinds of tasks appear in everyday manipulation yet current models lack reliable ways to track history. If the findings are correct, future work can focus on choosing or combining memory types according to what a given task actually requires instead of searching for one universal solution.

Core claim

The authors claim that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. This conclusion rests on a benchmark of sixteen manipulation tasks built under a taxonomy of temporal, spatial, object, and procedural memory, together with experiments on fourteen memory-augmented variants of a single base model.

What carries the argument

A taxonomy that divides memory requirements into temporal, spatial, object, and procedural categories, used to structure both the creation of test tasks and the comparison of integration strategies.

If this is right

Model builders should match memory mechanisms to the dominant requirement of a task, such as counting steps for temporal needs or recovering from occlusions for object needs.
Standardized benchmarks make it possible to measure incremental progress in history-dependent robotic manipulation instead of relying on isolated demonstrations.
Generalist policies may need to incorporate multiple memory types or switch between them when facing varied task demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid memory systems that detect task features and activate the most suitable representation could extend performance across a wider range of scenarios.
Applying the same taxonomy to physical robot experiments would reveal whether simulation results hold when sensor noise and actuation errors are present.
Similar task-dependent patterns may appear in other sequential control problems such as navigation or assembly planning.

Load-bearing premise

The four memory categories accurately capture the needs of real long-horizon robotic manipulation and the sixteen tasks are representative enough to support general conclusions.

What would settle it

A follow-up test in which one memory design outperforms all others on every task in the set or on a fresh collection of long-horizon tasks that still fit the same overall description.

Figures

Figures reproduced from arXiv: 2603.04639 by Chelsea Finn, Haoran Zhang, Hongze Fu, Jayjun Lee, Jianing Yang, Joyce Chai, Nima Fazeli, Yinpei Dai, Yuejiang Liu.

**Figure 1.** Figure 1: RoboMME is a large-scale robotic benchmark for evaluating memory-augmented manipulation, comprising four task suites that emphasize distinct memory demands. (1) The Counting suite targets temporal memory, requiring robots to accumulate and reason over past events, e.g., counting placed green cubes and stopping correctly (top-left). (2) The Permanence suite focuses on spatial memory, requiring tracking of o… view at source ↗

read the original abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboMME introduces a new benchmark and taxonomy for memory in robotic VLAs with 14 variants on pi0.5, but the task-dependent results may trace to capacity differences rather than the memory designs themselves.

read the letter

The main things to know are that this paper creates RoboMME, a standardized benchmark of 16 long-horizon manipulation tasks organized by a taxonomy of temporal, spatial, object, and procedural memory, and then evaluates 14 memory-augmented variants built on the pi0.5 backbone. The central observation is that which memory approach helps depends on the specific task, such as counting repeated actions or handling temporary occlusions. This setup fills a real gap in how VLA models get tested for history-dependent work, and the systematic variants plus public code and videos make the contribution concrete rather than just another model tweak. The benchmark itself looks like something others could actually use for comparisons. On the soft spots, the stress-test concern lands: the 14 variants use different integration strategies that likely shift parameter counts, hidden dimensions, or training behavior, yet there is no mention of capacity-matched controls or FLOPs reporting. That leaves open the possibility that the reported advantages come from incidental model differences instead of the intended memory categories. The abstract also gives no error bars, significance tests, or task-construction details, which makes the task-dependent claim harder to weigh without the full methods section. If those controls and stats are absent from the paper too, it weakens the attribution. This work is aimed at researchers building or evaluating generalist robotic policies who need better ways to measure memory. Anyone working on VLA models for long-horizon tasks would get practical value from trying the benchmark or the variants. It has enough new artifacts and addresses a clear evaluation gap to deserve a serious referee, even if the experimental design needs tightening on capacity and statistics. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoboMME, a standardized benchmark of 16 long-horizon robotic manipulation tasks organized under a taxonomy of temporal, spatial, object, and procedural memory. It constructs 14 memory-augmented variants of the π0.5 VLA backbone via different integration strategies (recurrent, attention-over-history, external memory) and reports that memory effectiveness is highly task-dependent, with each design showing distinct advantages and limitations.

Significance. If the results hold after addressing controls, this benchmark could standardize evaluation of memory mechanisms in VLA models and clarify design trade-offs for long-horizon robotics. The public release of code and videos supports reproducibility and is a clear strength.

major comments (2)

[§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.
[Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.

minor comments (2)

[§3] The taxonomy is introduced without explicit validation against real-world long-horizon task distributions; a short discussion or reference to how the 16 tasks were selected would strengthen the claim of representativeness.
[Figures/Tables] Figure legends and tables comparing the 14 variants should include explicit capacity metrics to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.

Authors: We agree that parameter counts and FLOPs are necessary for transparent comparison. In the revised version we will add a dedicated table reporting parameter counts and estimated FLOPs for each of the 14 variants relative to the π0.5 backbone. On capacity-matched controls, the variants modify only the memory integration module while freezing the core VLA weights and architecture; this keeps overall capacity differences modest (typically <5% additional parameters). Nevertheless, to directly address the concern we will include a new paragraph discussing capacity implications and, where feasible, report results from a capacity-matched ablation that equalizes total parameters across representative variants. revision: partial
Referee: [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.

Authors: We accept that the current presentation lacks sufficient statistical detail. We will augment all result figures with error bars computed over multiple random seeds and add statistical significance tests (paired t-tests with Bonferroni correction) between memory variants on each task. Task construction details appear in Section 3, but we will expand this section with explicit criteria used to isolate temporal, spatial, object, and procedural memory requirements. No data points were excluded from the reported results; we will state this explicitly and describe the full evaluation protocol (including number of trials per task) to improve reproducibility and generalizability assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

This is a purely empirical benchmarking paper that introduces a taxonomy of memory types, constructs 16 tasks, and evaluates 14 variants on a fixed backbone through direct experimentation. No mathematical derivations, first-principles predictions, or fitted parameters are claimed to produce new results; the central claims rest on observed performance differences across tasks. The taxonomy and variants are presented as design choices for systematic comparison rather than outputs derived from prior results within the paper. No self-citation is used to justify uniqueness or forbid alternatives, and no step reduces to an input by construction. The skeptic concern about capacity matching is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that the four-category memory taxonomy is sufficient and that tasks constructed under it reflect genuine robotic needs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The taxonomy of temporal, spatial, object, and procedural memory covers the relevant history-dependent aspects of robotic manipulation tasks.
Invoked when constructing the 16 tasks and interpreting results as generalizable.

pith-pipeline@v0.9.0 · 5733 in / 1198 out tokens · 28784 ms · 2026-05-21T12:23:41.670571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RoboMME categorizes memory into four cognitive dimensions: (1) temporal memory for event accumulation and ordering; (2) spatial memory for tracking object locations under occlusion...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
cs.AI 2026-03 accept novelty 5.0

vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 3 Pith papers · 9 internal anchors

[1]

Human memory: A proposed system and its control processes

Richard C Atkinson and Richard M Shiffrin. Human memory: A proposed system and its control processes. InPsychology of learning and motivation, volume 2, pages 89–195. Elsevier, 1968

work page 1968
[2]

Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters

Stephanie J Babb and Ruth M Johnson. Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters. Behaviour, brain & cognition, 26 (2, 2010), 2011

work page 2010
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 17–40

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, and et al.𝜋0.5: a vision-language-action model with open-world generalization. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Resea...

work page 2025
[5]

Learning to act anywhere with task-centric latent actions.RSS, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.RSS, 2025

work page 2025
[6]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

work page 2022
[7]

Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

work page arXiv 2023
[8]

The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

Neil Burgess, Eleanor A Maguire, and John O’Keefe. The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

work page 2002
[9]

History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-aware visuomotor policy learning via point tracking.arXiv preprint arXiv:2509.17141, 2025

work page arXiv 2025
[10]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

work page 2024
[11]

Kovalev, and Aleksandr I

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning. InProceedings of the 7th Robot Learning Workshop at ICLR, 2025. arXiv:2502.10550

work page arXiv 2025
[12]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[13]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025. 11 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

work page 2025
[14]

Think, act, and ask: Open-world interactive personalized robot navigation

Yinpei Dai, Run Peng, Sikai Li, and Joyce Chai. Think, act, and ask: Open-world interactive personalized robot navigation. In2024 IEEE international conference on robotics and automation (ICRA), pages 3296–3303. IEEE, 2024

work page 2024
[15]

Racer: Rich language-guided failure recovery policies for imitation learning

Yinpei Dai, Jayjun Lee, Nima Fazeli, and Joyce Chai. Racer: Rich language-guided failure recovery policies for imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15657–15664. IEEE, 2025

work page 2025
[16]

Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

work page arXiv 2025
[17]

Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

work page 2025
[18]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

work page 2024
[19]

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

work page arXiv 2025
[20]

Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

Michael E Hasselmo, James R Hinman, Holger Dannenberg, and Chantal E Stern. Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

work page 2017
[21]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

work page 2020
[22]

arXiv:2510.04246 [cs]

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision- language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025

work page arXiv 2025
[23]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jin- woo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

work page arXiv 2026
[27]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

work page 2023
[28]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 12 RoboMME: Benchmarking and Understanding Memory for Rob...

work page 2024
[29]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

work page 2022
[31]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks, 2021

work page 2021
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[33]

Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

work page 2026
[34]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

work page 2023
[35]

Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

Larry R Squire. Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

work page 2004
[36]

Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

work page arXiv 2025
[37]

Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

work page 2025
[38]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025

work page arXiv 2025
[40]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[42]

tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

work page arXiv 2026
[43]

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372, 2025

work page arXiv 2025
[44]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 13 RoboMME: Benchmarking and Understanding Memory for Robo...

work page 2025
[45]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

work page 2025
[46]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

work page 2025
[48]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 14 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Appendix Outline •A. Model Architectures in the ...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[49]

put one blue cube into the bin and press the button to stop

Placeonecube of a specified color.e.g., “put one blue cube into the bin and press the button to stop”

work page
[50]

put one blue cube and two green cubes into the bin and press the button to stop

Place cubes oftwospecified colors.e.g., “put one blue cube and two green cubes into the bin and press the button to stop”

work page
[51]

put one blue cube, one green cube and two red cubes into the bin and press the button to stop

Place cubes ofthreespecified colors.e.g., “put one blue cube, one green cube and two red cubes into the bin and press the button to stop”. Task CharacteristicsTo introduce dynamic and history-dependent behavior, we consider two settings randomly selected at environment initialization: 1.Static:all cubes are present at the beginning of the episode. 2.Strea...

work page
[52]

pick up the blue cube and place it on the target, then press the button to stop

Pick and place foronetime.e.g., “pick up the blue cube and place it on the target, then press the button to stop”

work page
[53]

pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop

Pickandplaceformultipletimes.e.g., “pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop”. Successful Pick-and-PlaceA pick-and-place is considered successful when the robot lifts the cube above a predefined height threshold while maintaining a valid grasp, and then lowers it onto the target ...

work page
[54]

The robot picks up a wrong cube

work page
[55]

The button is pressed before all required repetitions are completed

work page
[56]

The robot performs more pick-and-place repetitions than specified. 38 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure14:SwingXtimes Task Example.In this instance, the goal is pick up the green cube, first move it to the right-side target, then put down the cube on the left-side target (i.e., swing between targets one ...

work page
[57]

Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop

Performoneswing cycle.e.g., “Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop”

work page
[58]

Performmultipleswing cycles.e.g., “Pick up the green cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.”. Successful ReachA reach is successful when the cube is held nearly upright and positioned within a small tolerance of the...

work page
[59]

The robot picks up the wrong cube

work page
[60]

The button is pressed before all repetitions are completed

work page
[61]

press the button to stop the cube exactly at the target on its third visit

The robot reaches either target more than the specified number of times (excessive swings). 39 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure15:StopCube Task Example.In this instance, the goal is press the button to stop the cube exactly at the target on its second visit. E.4. StopCube Table 16:StopCubeTask Configura...

work page
[62]

StaticBehavior:Tosucceed, therobotmustpositionitsend-effectoroverthebuttonandremainstatic(hovering) until the correct timing

work page
[63]

The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button

Immediate Stop:Pressing the button stops the cube instantly. The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button. Success Criteria • Precise Synchronization:The button must be pressed strictly within the time window when the cube overlaps with the target. •C...

work page
[64]

Watch the video carefully, then pick up the container hiding the green cube

Pickonecontainer hiding a specified cube e.g., “Watch the video carefully, then pick up the container hiding the green cube.”

work page
[65]

Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwocontainers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green/blue) is...

work page
[66]

First press the button on the table, then pick up the container hiding the green cube

Pickonespecified container by cube color e.g., “First press the button on the table, then pick up the container hiding the green cube.”

work page
[67]

First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot need to presses the button at the beginning. During the press, multiple containers are concurrently p...

work page
[69]

The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 42 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure18:VideoUnmaskSwap Task Example.In this instance, the robot first watches the video, then picks up the container hiding the blue cube, followed by the one hiding the green cube. The vide...

work page
[70]

Watch the video carefully, then pick up the container hiding the blue cube

Pickonespecified container e.g., “Watch the video carefully, then pick up the container hiding the blue cube.”

work page
[71]

Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green...

work page
[72]

First press both buttons on the table, then pick up the container hiding the red cube

Pickonespecified container e.g., “First press both buttons on the table, then pick up the container hiding the red cube.”

work page
[73]

First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot needs to press both buttons. During the pressing, multiple containers are placed on the table to en...

work page
[74]

The robot picks up any container before completing the button-press phase

work page
[75]

first press the button, then pick up all highlighted cubes, finally press the button again to stop

The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 44 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure20:PickHighlight Task Example.In this instance, the goal is first press the button, then pick up all cubeshighlighted by white area, finally press the button again to stop. During the bu...

work page
[76]

The robot fails to press the button before attempting a pick

work page
[77]

Watch the video carefully, then pick up the same cube that was previously picked twice, and finally press the button to stop

The robot picks up any wrong cubes (i.e., a cube that was not highlighted). 45 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure21:VideoRepick Task Example.In this instance, the robot first watches a video, then picks up the same previously picked cube twice, and finally presses the button to stop. Red-bordered frames d...

work page
[78]

the robot picks up the wrong cube,

work page
[79]

the button is pressed before finishing𝑁repetitions, or

work page
[80]

Watch the video carefully and place the blue cube on the target where it was placed immediately before the button was pressed

the robot completes more than𝑁repetitions. 46 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure22:VideoPlaceButton Task Example.In this example, the robot first observes a video depicting an interleaved sequence of cube placements and button presses, and then places the cube onto the correct target corresponding to its ...

work page
[81]

The robot picks up an incorrect peg

work page

Showing first 80 references.

[1] [1]

Human memory: A proposed system and its control processes

Richard C Atkinson and Richard M Shiffrin. Human memory: A proposed system and its control processes. InPsychology of learning and motivation, volume 2, pages 89–195. Elsevier, 1968

work page 1968

[2] [2]

Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters

Stephanie J Babb and Ruth M Johnson. Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters. Behaviour, brain & cognition, 26 (2, 2010), 2011

work page 2010

[3] [3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 17–40

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, and et al.𝜋0.5: a vision-language-action model with open-world generalization. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Resea...

work page 2025

[5] [5]

Learning to act anywhere with task-centric latent actions.RSS, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.RSS, 2025

work page 2025

[6] [6]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

work page 2022

[7] [7]

Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

work page arXiv 2023

[8] [8]

The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

Neil Burgess, Eleanor A Maguire, and John O’Keefe. The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002

work page 2002

[9] [9]

History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-aware visuomotor policy learning via point tracking.arXiv preprint arXiv:2509.17141, 2025

work page arXiv 2025

[10] [10]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

work page 2024

[11] [11]

Kovalev, and Aleksandr I

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning. InProceedings of the 7th Robot Learning Workshop at ICLR, 2025. arXiv:2502.10550

work page arXiv 2025

[12] [12]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025

[13] [13]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025. 11 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

work page 2025

[14] [14]

Think, act, and ask: Open-world interactive personalized robot navigation

Yinpei Dai, Run Peng, Sikai Li, and Joyce Chai. Think, act, and ask: Open-world interactive personalized robot navigation. In2024 IEEE international conference on robotics and automation (ICRA), pages 3296–3303. IEEE, 2024

work page 2024

[15] [15]

Racer: Rich language-guided failure recovery policies for imitation learning

Yinpei Dai, Jayjun Lee, Nima Fazeli, and Joyce Chai. Racer: Rich language-guided failure recovery policies for imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15657–15664. IEEE, 2025

work page 2025

[16] [16]

Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

work page arXiv 2025

[17] [17]

Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025

work page 2025

[18] [18]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

work page 2024

[19] [19]

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

work page arXiv 2025

[20] [20]

Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

Michael E Hasselmo, James R Hinman, Holger Dannenberg, and Chantal E Stern. Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017

work page 2017

[21] [21]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

work page 2020

[22] [22]

arXiv:2510.04246 [cs]

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision- language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025

work page arXiv 2025

[23] [23]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jin- woo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

work page arXiv 2026

[27] [27]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

work page 2023

[28] [28]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 12 RoboMME: Benchmarking and Understanding Memory for Rob...

work page 2024

[29] [29]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

work page 2022

[31] [31]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks, 2021

work page 2021

[32] [32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[33] [33]

Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026

work page 2026

[34] [34]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

work page 2023

[35] [35]

Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

Larry R Squire. Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004

work page 2004

[36] [36]

Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

work page arXiv 2025

[37] [37]

Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025

work page 2025

[38] [38]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025

work page arXiv 2025

[40] [40]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[42] [42]

tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026

work page arXiv 2026

[43] [43]

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372, 2025

work page arXiv 2025

[44] [44]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 13 RoboMME: Benchmarking and Understanding Memory for Robo...

work page 2025

[45] [45]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

work page 2025

[46] [46]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025

work page 2025

[48] [48]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 14 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Appendix Outline •A. Model Architectures in the ...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[49] [49]

put one blue cube into the bin and press the button to stop

Placeonecube of a specified color.e.g., “put one blue cube into the bin and press the button to stop”

work page

[50] [50]

put one blue cube and two green cubes into the bin and press the button to stop

Place cubes oftwospecified colors.e.g., “put one blue cube and two green cubes into the bin and press the button to stop”

work page

[51] [51]

put one blue cube, one green cube and two red cubes into the bin and press the button to stop

Place cubes ofthreespecified colors.e.g., “put one blue cube, one green cube and two red cubes into the bin and press the button to stop”. Task CharacteristicsTo introduce dynamic and history-dependent behavior, we consider two settings randomly selected at environment initialization: 1.Static:all cubes are present at the beginning of the episode. 2.Strea...

work page

[52] [52]

pick up the blue cube and place it on the target, then press the button to stop

Pick and place foronetime.e.g., “pick up the blue cube and place it on the target, then press the button to stop”

work page

[53] [53]

pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop

Pickandplaceformultipletimes.e.g., “pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop”. Successful Pick-and-PlaceA pick-and-place is considered successful when the robot lifts the cube above a predefined height threshold while maintaining a valid grasp, and then lowers it onto the target ...

work page

[54] [54]

The robot picks up a wrong cube

work page

[55] [55]

The button is pressed before all required repetitions are completed

work page

[56] [56]

The robot performs more pick-and-place repetitions than specified. 38 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure14:SwingXtimes Task Example.In this instance, the goal is pick up the green cube, first move it to the right-side target, then put down the cube on the left-side target (i.e., swing between targets one ...

work page

[57] [57]

Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop

Performoneswing cycle.e.g., “Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop”

work page

[58] [58]

Performmultipleswing cycles.e.g., “Pick up the green cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.”. Successful ReachA reach is successful when the cube is held nearly upright and positioned within a small tolerance of the...

work page

[59] [59]

The robot picks up the wrong cube

work page

[60] [60]

The button is pressed before all repetitions are completed

work page

[61] [61]

press the button to stop the cube exactly at the target on its third visit

The robot reaches either target more than the specified number of times (excessive swings). 39 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure15:StopCube Task Example.In this instance, the goal is press the button to stop the cube exactly at the target on its second visit. E.4. StopCube Table 16:StopCubeTask Configura...

work page

[62] [62]

StaticBehavior:Tosucceed, therobotmustpositionitsend-effectoroverthebuttonandremainstatic(hovering) until the correct timing

work page

[63] [63]

The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button

Immediate Stop:Pressing the button stops the cube instantly. The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button. Success Criteria • Precise Synchronization:The button must be pressed strictly within the time window when the cube overlaps with the target. •C...

work page

[64] [64]

Watch the video carefully, then pick up the container hiding the green cube

Pickonecontainer hiding a specified cube e.g., “Watch the video carefully, then pick up the container hiding the green cube.”

work page

[65] [65]

Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwocontainers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green/blue) is...

work page

[66] [66]

First press the button on the table, then pick up the container hiding the green cube

Pickonespecified container by cube color e.g., “First press the button on the table, then pick up the container hiding the green cube.”

work page

[67] [67]

First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot need to presses the button at the beginning. During the press, multiple containers are concurrently p...

work page

[68] [69]

The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 42 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure18:VideoUnmaskSwap Task Example.In this instance, the robot first watches the video, then picks up the container hiding the blue cube, followed by the one hiding the green cube. The vide...

work page

[69] [70]

Watch the video carefully, then pick up the container hiding the blue cube

Pickonespecified container e.g., “Watch the video carefully, then pick up the container hiding the blue cube.”

work page

[70] [71]

Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green...

work page

[71] [72]

First press both buttons on the table, then pick up the container hiding the red cube

Pickonespecified container e.g., “First press both buttons on the table, then pick up the container hiding the red cube.”

work page

[72] [73]

First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube

Picktwospecified containers sequentially (order matters) e.g., “First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot needs to press both buttons. During the pressing, multiple containers are placed on the table to en...

work page

[73] [74]

The robot picks up any container before completing the button-press phase

work page

[74] [75]

first press the button, then pick up all highlighted cubes, finally press the button again to stop

The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 44 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure20:PickHighlight Task Example.In this instance, the goal is first press the button, then pick up all cubeshighlighted by white area, finally press the button again to stop. During the bu...

work page

[75] [76]

The robot fails to press the button before attempting a pick

work page

[76] [77]

Watch the video carefully, then pick up the same cube that was previously picked twice, and finally press the button to stop

The robot picks up any wrong cubes (i.e., a cube that was not highlighted). 45 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure21:VideoRepick Task Example.In this instance, the robot first watches a video, then picks up the same previously picked cube twice, and finally presses the button to stop. Red-bordered frames d...

work page

[77] [78]

the robot picks up the wrong cube,

work page

[78] [79]

the button is pressed before finishing𝑁repetitions, or

work page

[79] [80]

Watch the video carefully and place the blue cube on the target where it was placed immediately before the button was pressed

the robot completes more than𝑁repetitions. 46 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure22:VideoPlaceButton Task Example.In this example, the robot first observes a video depicting an interleaved sequence of cube placements and button presses, and then places the cube onto the correct target corresponding to its ...

work page

[80] [81]

The robot picks up an incorrect peg

work page