RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Chelsea Finn; Haoran Zhang; Hongze Fu; Jayjun Lee; Jianing Yang; Joyce Chai; Nima Fazeli; Yinpei Dai; Yuejiang Liu

arxiv: 2603.04639 · v3 · pith:PF2SNA4Nnew · submitted 2026-03-04 · 💻 cs.RO · cs.AI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai , Hongze Fu , Jayjun Lee , Yuejiang Liu , Haoran Zhang , Jianing Yang , Chelsea Finn , Nima Fazeli

show 1 more author

Joyce Chai

This is my paper

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic manipulationmemory mechanismsvision-language-action modelslong-horizon tasksbenchmark evaluationtask-dependent performancegeneralist policies

0 comments

The pith

Memory representations for robotic policies show effectiveness that depends on the specific task rather than a single best design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a consistent way to compare how different memory mechanisms help vision-language-action models handle robotic tasks that unfold over many steps and require recalling past information. It organizes evaluation around four memory categories and builds sixteen tasks to test them. Multiple variants are then created by adding different memory structures to one base model. A sympathetic reader would care because these kinds of tasks appear in everyday manipulation yet current models lack reliable ways to track history. If the findings are correct, future work can focus on choosing or combining memory types according to what a given task actually requires instead of searching for one universal solution.

Core claim

The authors claim that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. This conclusion rests on a benchmark of sixteen manipulation tasks built under a taxonomy of temporal, spatial, object, and procedural memory, together with experiments on fourteen memory-augmented variants of a single base model.

What carries the argument

A taxonomy that divides memory requirements into temporal, spatial, object, and procedural categories, used to structure both the creation of test tasks and the comparison of integration strategies.

If this is right

Model builders should match memory mechanisms to the dominant requirement of a task, such as counting steps for temporal needs or recovering from occlusions for object needs.
Standardized benchmarks make it possible to measure incremental progress in history-dependent robotic manipulation instead of relying on isolated demonstrations.
Generalist policies may need to incorporate multiple memory types or switch between them when facing varied task demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid memory systems that detect task features and activate the most suitable representation could extend performance across a wider range of scenarios.
Applying the same taxonomy to physical robot experiments would reveal whether simulation results hold when sensor noise and actuation errors are present.
Similar task-dependent patterns may appear in other sequential control problems such as navigation or assembly planning.

Load-bearing premise

The four memory categories accurately capture the needs of real long-horizon robotic manipulation and the sixteen tasks are representative enough to support general conclusions.

What would settle it

A follow-up test in which one memory design outperforms all others on every task in the set or on a fresh collection of long-horizon tasks that still fit the same overall description.

Figures

Figures reproduced from arXiv: 2603.04639 by Chelsea Finn, Haoran Zhang, Hongze Fu, Jayjun Lee, Jianing Yang, Joyce Chai, Nima Fazeli, Yinpei Dai, Yuejiang Liu.

**Figure 1.** Figure 1: RoboMME is a large-scale robotic benchmark for evaluating memory-augmented manipulation, comprising four task suites that emphasize distinct memory demands. (1) The Counting suite targets temporal memory, requiring robots to accumulate and reason over past events, e.g., counting placed green cubes and stopping correctly (top-left). (2) The Permanence suite focuses on spatial memory, requiring tracking of o… view at source ↗

read the original abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboMME introduces a new benchmark and taxonomy for memory in robotic VLAs with 14 variants on pi0.5, but the task-dependent results may trace to capacity differences rather than the memory designs themselves.

read the letter

The main things to know are that this paper creates RoboMME, a standardized benchmark of 16 long-horizon manipulation tasks organized by a taxonomy of temporal, spatial, object, and procedural memory, and then evaluates 14 memory-augmented variants built on the pi0.5 backbone. The central observation is that which memory approach helps depends on the specific task, such as counting repeated actions or handling temporary occlusions. This setup fills a real gap in how VLA models get tested for history-dependent work, and the systematic variants plus public code and videos make the contribution concrete rather than just another model tweak. The benchmark itself looks like something others could actually use for comparisons. On the soft spots, the stress-test concern lands: the 14 variants use different integration strategies that likely shift parameter counts, hidden dimensions, or training behavior, yet there is no mention of capacity-matched controls or FLOPs reporting. That leaves open the possibility that the reported advantages come from incidental model differences instead of the intended memory categories. The abstract also gives no error bars, significance tests, or task-construction details, which makes the task-dependent claim harder to weigh without the full methods section. If those controls and stats are absent from the paper too, it weakens the attribution. This work is aimed at researchers building or evaluating generalist robotic policies who need better ways to measure memory. Anyone working on VLA models for long-horizon tasks would get practical value from trying the benchmark or the variants. It has enough new artifacts and addresses a clear evaluation gap to deserve a serious referee, even if the experimental design needs tightening on capacity and statistics. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoboMME, a standardized benchmark of 16 long-horizon robotic manipulation tasks organized under a taxonomy of temporal, spatial, object, and procedural memory. It constructs 14 memory-augmented variants of the π0.5 VLA backbone via different integration strategies (recurrent, attention-over-history, external memory) and reports that memory effectiveness is highly task-dependent, with each design showing distinct advantages and limitations.

Significance. If the results hold after addressing controls, this benchmark could standardize evaluation of memory mechanisms in VLA models and clarify design trade-offs for long-horizon robotics. The public release of code and videos supports reproducibility and is a clear strength.

major comments (2)

[§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.
[Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.

minor comments (2)

[§3] The taxonomy is introduced without explicit validation against real-world long-horizon task distributions; a short discussion or reference to how the 16 tasks were selected would strengthen the claim of representativeness.
[Figures/Tables] Figure legends and tables comparing the 14 variants should include explicit capacity metrics to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.

Authors: We agree that parameter counts and FLOPs are necessary for transparent comparison. In the revised version we will add a dedicated table reporting parameter counts and estimated FLOPs for each of the 14 variants relative to the π0.5 backbone. On capacity-matched controls, the variants modify only the memory integration module while freezing the core VLA weights and architecture; this keeps overall capacity differences modest (typically <5% additional parameters). Nevertheless, to directly address the concern we will include a new paragraph discussing capacity implications and, where feasible, report results from a capacity-matched ablation that equalizes total parameters across representative variants. revision: partial
Referee: [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.

Authors: We accept that the current presentation lacks sufficient statistical detail. We will augment all result figures with error bars computed over multiple random seeds and add statistical significance tests (paired t-tests with Bonferroni correction) between memory variants on each task. Task construction details appear in Section 3, but we will expand this section with explicit criteria used to isolate temporal, spatial, object, and procedural memory requirements. No data points were excluded from the reported results; we will state this explicitly and describe the full evaluation protocol (including number of trials per task) to improve reproducibility and generalizability assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

This is a purely empirical benchmarking paper that introduces a taxonomy of memory types, constructs 16 tasks, and evaluates 14 variants on a fixed backbone through direct experimentation. No mathematical derivations, first-principles predictions, or fitted parameters are claimed to produce new results; the central claims rest on observed performance differences across tasks. The taxonomy and variants are presented as design choices for systematic comparison rather than outputs derived from prior results within the paper. No self-citation is used to justify uniqueness or forbid alternatives, and no step reduces to an input by construction. The skeptic concern about capacity matching is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that the four-category memory taxonomy is sufficient and that tasks constructed under it reflect genuine robotic needs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The taxonomy of temporal, spatial, object, and procedural memory covers the relevant history-dependent aspects of robotic manipulation tasks.
Invoked when constructing the 16 tasks and interpreting results as generalizable.

pith-pipeline@v0.9.0 · 5733 in / 1198 out tokens · 28784 ms · 2026-05-21T12:23:41.670571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RoboMME categorizes memory into four cognitive dimensions: (1) temporal memory for event accumulation and ordering; (2) spatial memory for tracking object locations under occlusion...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
cs.AI 2026-03 accept novelty 5.0

vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.