Introduces RoboMME-Interference benchmark showing memory-augmented VLAs improve without distractors but decay steadily as unrelated sessions accumulate in history.
Title resolution pending
29 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 29representative citing papers
AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
KEMO is an event-driven keyframe memory system that improves VLA policy success rates by 23.6% on real dual-arm tasks by selectively preserving task-relevant history via kinematics-visual event detection and gated fusion.
w²VLA restructures VLA information flow to decouple declarative semantics from procedural skills, enabling zero-shot transfer to novel objects.
CAMP learns a compressed behavioral memory from action history to enable success in long-horizon partially observable object manipulation without extra supervision, showing gains over baselines in real-robot and simulation tests.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
SERF conditions VLA policies on online-updated neural point maps of environment and robot to improve long-horizon mobile manipulation on BEHAVIOR-1K.
DAM-VLA decouples per-modality temporal processing in vision-language-action models via latent buffers refreshed at sensor rates, achieving 95.2% average success versus 40.95% for synchronous baselines on seven real-world manipulation tasks while enabling 100 Hz control.
AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.
VLA models with inference-time steering mitigate action leakage in implicit human-robot collaboration, supporting longer horizons and yielding faster, more reliable assembly than shorter-horizon baselines in a 16-person study.
Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
FOCA improves few-shot VLA adaptation by explicitly predicting future interaction embeddings and implicitly aligning to goal observations, yielding up to 26% gains on real robots with only 20 demonstrations.
MemoryVAM integrates a Perceiver-based Recap Compressor and Cue Gate into video action models, raising success rates on long-horizon manipulation from 5% to 42.5% on LIBERO-Mem and 75-80% on real-robot counting, spatial recall, and tracking tasks.
Autoregressive VLA policies achieve real-time execution via tokenization horizon adjustment and constrained decoding, outperforming flow-matching policies in speed and performance across simulated and real environments.
citing papers explorer
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.