Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al · 2026 · arXiv 2603.03596

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

Freeform Preference Learning for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.

$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

cs.RO · 2026-05-29 · unverdicted · novelty 6.0

Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

cs.RO · 2026-04-30 · unverdicted · novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.

DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 5.0

MemoryVLA++ integrates a perceptual-cognitive memory bank and denoising world model into VLA models to enable temporal reasoning, yielding performance gains on manipulation benchmarks and real-robot tasks.

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

cs.RO · 2026-06-04 · unverdicted · novelty 3.0

A structured literature survey of safety mechanisms in long-horizon robotic manipulation organized by intervention timing and strength of supporting evidence.

citing papers explorer

Showing 16 of 16 citing papers.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM cs.AI · 2026-06-01 · unverdicted · none · ref 58
AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 5
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 37
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning cs.RO · 2026-02-23 · unverdicted · none · ref 61
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
Freeform Preference Learning for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 25
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 16
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies cs.CV · 2026-06-18 · unverdicted · none · ref 9
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models cs.LG · 2026-06-10 · unverdicted · none · ref 60
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring cs.RO · 2026-05-29 · unverdicted · none · ref 7
Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark cs.RO · 2026-05-11 · unverdicted · none · ref 13
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control cs.RO · 2026-04-30 · unverdicted · none · ref 25
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks cs.CV · 2026-04-13 · unverdicted · none · ref 25
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models cs.RO · 2026-06-09 · unverdicted · none · ref 64
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 61
MemoryVLA++ integrates a perceptual-cognitive memory bank and denoising world model into VLA models to enable temporal reasoning, yielding performance gains on manipulation benchmarks and real-robot tasks.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection cs.RO · 2026-04-15 · unverdicted · none · ref 16
A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation cs.RO · 2026-06-04 · unverdicted · none · ref 282
A structured literature survey of safety mechanisms in long-horizon robotic manipulation organized by intervention timing and strength of supporting evidence.

Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer