EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Ganlin Yang; Jiafei Cao; Jiaqi Peng; Jifeng Dai; Jing Xiong; Junyi Dong; Sitong Mao; Tai Wang; Tianxing Chen; Wengang Zhou

arxiv: 2606.20092 · v1 · pith:RSNYAYAWnew · submitted 2026-06-18 · 💻 cs.CV

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Ganlin Yang , Zhangzheng Tu , Yuqiang Yang , Sitong Mao , Junyi Dong , Tianxing Chen , Jiaqi Peng , Jing Xiong

show 5 more authors

Jiafei Cao Jifeng Dai Wengang Zhou Yao Mu Tai Wang

This is my paper

Pith reviewed 2026-06-26 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords EventVLAKeyframe Evidence MemoryVision-Language-Action policiesLong-horizon robotic manipulationSparse visual memoryNon-Markovian tasksBimanual manipulationVisual evidence prediction

0 comments

The pith

EventVLA stores only future-critical visual keyframes by predicting their probabilities directly from a VLA policy's latent state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EventVLA to solve memory loss in long-horizon robot tasks where relevant visual cues become hidden or change. It keeps an end-to-end Vision-Language-Action policy but adds a sparse evidence buffer that decides on the fly which frames to retain. The decision uses a Keyframe Evidence Memory module that looks at the current internal representation and estimates which observations will still matter later. This replaces both full history buffers and separate memory systems. On seventeen simulated tasks that require remembering past states plus four real bimanual manipulation tasks, the method raises average success rates by forty percent compared with prior memory-augmented VLAs.

Core claim

EventVLA is an end-to-end framework that combines foundational visual anchors for short-term context with a dynamic Keyframe Evidence Memory module; the module predicts future keyframe probabilities straight from the policy's latent embeddings, thereby capturing and storing only the sparse visual events whose causal utility will matter after the current observation becomes unavailable.

What carries the argument

Keyframe Evidence Memory (KEM) module that predicts future keyframe probabilities from the VLA's latent embeddings to select and retain transient task-critical visual evidence.

If this is right

The policy can evaluate the future causal utility of each observation before the evidence vanishes.
Visual storage remains sparse, avoiding both information bottlenecks and accumulation of redundant frames.
The same architecture works for both simulated non-Markovian tasks and real-world bimanual manipulation.
No separate memory network or high-latency dual system is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-from-latent idea might apply to any policy whose internal state already encodes task progress, not only VLAs.
If the keyframe predictor can be trained with less supervision, the approach could extend to settings where labeled future-critical frames are expensive to obtain.
Storing only predicted-critical frames could lower the memory footprint enough to run longer-horizon tasks on resource-limited robots.

Load-bearing premise

Predicting future keyframe probabilities from the current latent embeddings will correctly identify which visual evidence will remain useful before it disappears.

What would settle it

Running the same tasks while replacing the learned keyframe-probability predictor with random or fixed selection and measuring whether success rates fall back to the level of prior memory-augmented baselines.

Figures

Figures reproduced from arXiv: 2606.20092 by Ganlin Yang, Jiafei Cao, Jiaqi Peng, Jifeng Dai, Jing Xiong, Junyi Dong, Sitong Mao, Tai Wang, Tianxing Chen, Wengang Zhou, Yao Mu, Yuqiang Yang, Zhangzheng Tu.

**Figure 1.** Figure 1: Overview of EventVLA. EventVLA tackles long-horizon, memory-requiring manipulation tasks by storing sparse, task-critical visual evidence. The figure illustrates the (a) non-Markovian challenge, (b) our proposed and evaluated benchmarks, (c) event-driven memory design, and (d) strong gains across simulation and real-world tasks. To avoid the massive redundancy of standard memory buffers [12, 14], we identi… view at source ↗

**Figure 2.** Figure 2: EventVLA framework. EventVLA maintains a sparse visual evidence memory composed of foundational visual anchors and interaction-driven event keyframes, and uses the KEM module to proactively commit task-critical future key observations into memory. To actively capture transient, interaction-driven events that foundational visual anchors inherently miss, such as the brief exposure of an occluded object, we … view at source ↗

**Figure 3.** Figure 3: Overview of the 8 evaluation tasks in the RoboTwin-MeM benchmark. To rigorously evaluate the capacity for intermediate visual evidence retention, each task is explicitly parameterized by n (ranges from 1 to 5), denoting the exact number of transient, interaction-driven keyframes that must be memorized to succeed. These task-critical intermediate events are highlighted with blue borders. visual anchors At a… view at source ↗

**Figure 4.** Figure 4: Real-world experimental setups and results on the ARX ACONE bimanual robot. We [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Expanded real-world execution sequences of EventVLA across the four manipulation [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative rollouts of EventVLA on four RoboTwin-MeM simulation tasks: [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative rollouts of EventVLA on the remaining four RoboTwin-MeM simulation tasks: [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative real-world robot execution sequences of EventVLA on four tasks: [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EventVLA's KEM predictor from latent embeddings targets a real memory gap in long-horizon VLAs, but the +40% claim sits on thin experimental detail.

read the letter

The main thing here is that EventVLA adds a foresight-driven Keyframe Evidence Memory that predicts future keyframe utility straight from the VLA latent embeddings, paired with a new benchmark for non-Markovian tasks. That selective storage idea is the concrete piece worth checking.

The paper frames the problem clearly: standard buffers either bloat with redundancy or use decoupled systems that add latency. The split into foundational anchors plus the dynamic KEM, plus the RoboTwin-MeM benchmark for tasks needing recall of transient interactive evidence, gives a focused testbed. The evaluations span 17 simulation tasks and 4 real bimanual ones, which is a reasonable scope for the subfield.

The soft spot is the lack of supporting breakdowns on the central mechanism. The abstract states the average success-rate gain but supplies no architecture for the probability predictor, no training objective, no error analysis on missed or false-positive keyframes, and no ablations that isolate KEM from the rest of the policy. Without those, it is hard to tell whether the reported improvement comes from accurate foresight or from other changes in the setup. The stress-test concern lands because prediction errors could either drop critical cues or inject noise, yet nothing in the provided text lets us judge the trade-off.

This is for roboticists working on VLA models and memory for extended manipulation. Someone testing selective storage on partial-observability tasks could use the benchmark and the overall framing.

I would send it for peer review. The problem is genuine and the proposed mechanism is specific enough that referees can examine the experiments and see whether the numbers hold up.

Referee Report

2 major / 1 minor

Summary. The paper proposes EventVLA, an end-to-end VLA framework for long-horizon robotic manipulation that addresses memory bottlenecks via sparse visual evidence memory. It consists of foundational visual anchors for initial/short-term context and a Keyframe Evidence Memory (KEM) module that predicts future keyframe probabilities directly from the VLA's latent embeddings to autonomously store transient task-critical visual events before they become unobservable. The paper also introduces the RoboTwin-MeM benchmark for evaluating non-Markovian manipulation tasks and claims an average +40% success rate improvement over state-of-the-art memory-augmented VLAs across 17 simulation tasks and 4 real-world bimanual tasks.

Significance. If the empirical gains hold under rigorous verification, the foresight-driven KEM approach could meaningfully advance memory-augmented VLAs by reducing information bottlenecks and visual redundancies in long-horizon tasks. The new RoboTwin-MeM benchmark for diagnostic evaluation of memory-requiring tasks would also be a useful contribution to the field.

major comments (2)

[Abstract] Abstract: the central claim of an average +40% success rate improvement is presented with no experimental details, baseline descriptions, statistical tests, ablation results, or error analysis, rendering the performance claim impossible to assess from the provided text.
[Abstract] Abstract: KEM is described as directly predicting future keyframe probabilities from current latent embeddings to preserve transient evidence, but no predictor architecture, training objective, or analysis of prediction accuracy versus policy degradation is supplied; this is load-bearing for the weakest assumption that such predictions reliably capture relevant events without injecting noise or missing cues.

minor comments (1)

[Abstract] Abstract: the term 'foundational visual anchors' is introduced without a definition or description of its implementation or interaction with KEM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The abstract is intentionally concise as a high-level summary; all requested experimental and methodological details are provided in the main body of the manuscript. We respond to each comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of an average +40% success rate improvement is presented with no experimental details, baseline descriptions, statistical tests, ablation results, or error analysis, rendering the performance claim impossible to assess from the provided text.

Authors: We agree the abstract omits granular experimental information by design. The full protocol—including baseline descriptions (comparisons against state-of-the-art memory-augmented VLAs), statistical tests, ablation results, and error analysis with standard deviations—is reported in Sections 4 and 5, supported by Tables 1–3 and Figures 3–6. The +40% figure represents the mean improvement across the 17 simulation and 4 real-world tasks, with per-task breakdowns and variance explicitly tabulated. This follows standard practice for abstracts in the field. revision: no
Referee: [Abstract] Abstract: KEM is described as directly predicting future keyframe probabilities from current latent embeddings to preserve transient evidence, but no predictor architecture, training objective, or analysis of prediction accuracy versus policy degradation is supplied; this is load-bearing for the weakest assumption that such predictions reliably capture relevant events without injecting noise or missing cues.

Authors: The KEM predictor architecture (lightweight head operating on VLA latents), training objective (joint supervised keyframe prediction and policy loss), and supporting analysis (prediction accuracy metrics plus ablations quantifying policy impact when predictions are noisy or incomplete) appear in Section 3.2 and Appendix B. Section 5.3 further includes targeted ablations demonstrating that the foresight mechanism yields net gains without measurable degradation from false positives or missed events. The abstract summarizes the mechanism at a high level; the load-bearing empirical validation is contained in the main text. revision: no

Circularity Check

0 steps flagged

No circularity: empirical results on benchmarks, no derivations or self-referential predictions

full rationale

The paper introduces EventVLA with a KEM module that predicts future keyframe probabilities from latent embeddings, but presents this as an architectural choice whose value is demonstrated through empirical success rates (+40% average improvement) on 17 simulation and 4 real-world tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central performance claim is an observed experimental outcome rather than a quantity derived by construction from the model's own inputs or prior self-referential results. This is the standard case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.1-grok · 5810 in / 1132 out tokens · 22880 ms · 2026-06-26T18:00:36.719542+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 18 linked inside Pith

[1]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[2]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[4]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

Pith/arXiv arXiv 2024
[5]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[6]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026
[7]

Sridhar, J

A. Sridhar, J. Pan, S. Sharma, and C. Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

arXiv 2025
[8]

T. Chen, Y . Wang, M. Li, Y . Qin, H. Shi, Z. Li, Y . Hu, Y . Zhang, K. Wang, Y . Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design. arXiv preprint arXiv:2603.01229, 2026

arXiv 2026
[9]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026
[10]

L. Xiao, J. Li, J. Gao, F. Ye, Y . Jin, J. Qian, J. Zhang, Y . Wu, and X. Yu. Ava-vla: Improving vision-language-action models with active visual attention.arXiv preprint arXiv:2511.18960, 2025

Pith/arXiv arXiv 2025
[11]

Bulatov, Y

A. Bulatov, Y . Kuratov, and M. Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 9

2022
[12]

H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Mem- oryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

Pith/arXiv arXiv 2025
[13]

H. Li, S. Yang, Y . Chen, Y . Tian, X. Yang, X. Chen, H. Wang, T. Wang, F. Zhao, D. Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation. arXiv e-prints, pages arXiv–2506, 2025

2025
[14]

X. Wang, X. Gao, J. Fu, Z. Li, D. Fortier, G. Mullins, A. Kolobov, and B. Guo. Lola: Long horizon latent action learning for general robot manipulation.arXiv preprint arXiv:2512.20166, 2025

arXiv 2025
[15]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[16]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

arXiv 2026
[17]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[18]

T. Chen, Y . Mu, Z. Liang, Z. Chen, S. Peng, Q. Chen, M. Xu, R. Hu, H. Zhang, X. Li, et al. G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1735–1744, 2025

2025
[19]

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025
[20]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y . Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language- action models.arXiv preprint arXiv:2512.09928, 2025

Pith/arXiv arXiv 2025
[21]

Liang, Y

Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Pith/arXiv arXiv 2025
[22]

G. Yang, T. Zhang, H. Hao, W. Wang, Y . Liu, D. Wang, G. Chen, Z. Cai, J. Chen, W. Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

arXiv 2025
[23]

W. Shen, Y . Liu, Y . Wu, Z. Liang, S. Gu, D. Wang, T. Nian, L. Xu, Y . Qin, J. Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning. arXiv preprint arXiv:2510.14300, 2025

arXiv 2025
[24]

J. Wen, Y . Zhu, M. Zhu, Z. Tang, J. Li, Z. Zhou, X. Liu, C. Shen, Y . Peng, and F. Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-second International Conference on Machine Learning, 2025

2025
[25]

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025
[26]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277– 54296, 2025. 10

2025
[27]

H. Tan, P. Co, Y . Xu, S. Rong, Y . Ji, C. Chi, X. Chen, Q. Zhang, Z. Zhao, P. Wang, et al. Action- sketcher: From reasoning to action via visual sketches for long-horizon robotic manipulation. arXiv preprint arXiv:2601.01618, 2026

arXiv 2026
[28]

H. Wang, Z. Jing, J. Ao, S. Song, X. Li, G. Huang, and C. Bai. Beyond short-horizon: Vq- memory for robust long-horizon manipulation in non-markovian simulation benchmarks.arXiv preprint arXiv:2603.09513, 2026

arXiv 2026
[29]

Torne, A

M. Torne, A. Tang, Y . Liu, and C. Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025

arXiv 2025
[30]

Y .-L. Wei, H. Liao, Y . Lin, P. Wang, Z. Liang, G. Liu, and W.-S. Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

arXiv 2025
[31]

H. Jang, S. Yu, H. Kwon, H. Jeon, Y . Seo, and J. Shin. Contextvla: Vision-language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025

arXiv 2025
[32]

M. Lin, X. Liang, B. Lin, L. Jingzhi, Z. Jiao, K. Li, Y . Ma, Y . Liu, S. Zhao, Y . Zhuang, et al. Echovla: Robotic vision-language-action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025

arXiv 2025
[33]

Y . Lei, Z. Liang, H. Zhang, and P. Luo. Vpwem: Non-markovian visuomotor policy with working and episodic memory.arXiv preprint arXiv:2603.04910, 2026

arXiv 2026
[34]

L. Tan, J. Li, and G. Jing. Memoact: Atkinson-shiffrin-inspired memory-augmented visuomotor policy for robotic manipulation.arXiv preprint arXiv:2603.18494, 2026

arXiv 2026
[35]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[36]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024
[37]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T.-k. Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

arXiv 2024
[38]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart´ın-Mart´ın, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023
[39]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024
[40]

H. Fang, M. Grotz, W. Pumacay, Y . R. Wang, D. Fox, R. Krishna, and J. Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

arXiv 2025
[41]

Cherepanov, N

E. Cherepanov, N. Kachaev, A. K. Kovalev, and A. I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550, 2025

arXiv 2025
[42]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023. 11

2023
[43]

S. Han, B. Qiu, Y . Liao, S. Huang, C. Gao, S. Yan, and S. Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.Advances in Neural Information Processing Systems, 38, 2026

2026
[44]

H. Lei, W. Song, H. Zhang, J. Pei, J. Chen, H. Yan, H. Zhao, P. Ding, Z. Zhang, L. Huang, et al. Robomemarena: A comprehensive and challenging robotic memory benchmark.arXiv preprint arXiv:2605.10921, 2026

Pith/arXiv arXiv 2026
[45]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020
[46]

Community

S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026
[47]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[48]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12 Appendix A Implementation Details of EventVLA A.1 Training Formulations and Curriculum ...

2023
[49]

Find key state transitions for this task (e.g., stable grasp acquired, object placed, cycle transition)
[50]

Keep keyframes representative and temporally ordered across the full task progress
[51]

Output format constraints:

For repeated pick/place cycles, pick the most stable and recognizable moments per cycle. Output format constraints:
[52]

No markdown, no explanations

Return JSON only. No markdown, no explanations
[53]

Format:{”keyframe steps”: [int, int, ...]} 3.keyframe stepsmust: - have length exactly<num keyframes> - be strictly increasing - be in [0,<total frames - 1>] - contain no duplicates Annotation Reliability and Error Analysis.To rigorously validate the reliability of this automated pipeline, we conducted a comprehensive cross-validation study. In the simula...

2000

[1] [1]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[2] [2]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[3] [3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[4] [4]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

Pith/arXiv arXiv 2024

[5] [5]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[6] [6]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026

[7] [7]

Sridhar, J

A. Sridhar, J. Pan, S. Sharma, and C. Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

arXiv 2025

[8] [8]

T. Chen, Y . Wang, M. Li, Y . Qin, H. Shi, Z. Li, Y . Hu, Y . Zhang, K. Wang, Y . Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design. arXiv preprint arXiv:2603.01229, 2026

arXiv 2026

[9] [9]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026

[10] [10]

L. Xiao, J. Li, J. Gao, F. Ye, Y . Jin, J. Qian, J. Zhang, Y . Wu, and X. Yu. Ava-vla: Improving vision-language-action models with active visual attention.arXiv preprint arXiv:2511.18960, 2025

Pith/arXiv arXiv 2025

[11] [11]

Bulatov, Y

A. Bulatov, Y . Kuratov, and M. Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 9

2022

[12] [12]

H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Mem- oryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

Pith/arXiv arXiv 2025

[13] [13]

H. Li, S. Yang, Y . Chen, Y . Tian, X. Yang, X. Chen, H. Wang, T. Wang, F. Zhao, D. Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation. arXiv e-prints, pages arXiv–2506, 2025

2025

[14] [14]

X. Wang, X. Gao, J. Fu, Z. Li, D. Fortier, G. Mullins, A. Kolobov, and B. Guo. Lola: Long horizon latent action learning for general robot manipulation.arXiv preprint arXiv:2512.20166, 2025

arXiv 2025

[15] [15]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[16] [16]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

arXiv 2026

[17] [17]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[18] [18]

T. Chen, Y . Mu, Z. Liang, Z. Chen, S. Peng, Q. Chen, M. Xu, R. Hu, H. Zhang, X. Li, et al. G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1735–1744, 2025

2025

[19] [19]

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025

[20] [20]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y . Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language- action models.arXiv preprint arXiv:2512.09928, 2025

Pith/arXiv arXiv 2025

[21] [21]

Liang, Y

Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Pith/arXiv arXiv 2025

[22] [22]

G. Yang, T. Zhang, H. Hao, W. Wang, Y . Liu, D. Wang, G. Chen, Z. Cai, J. Chen, W. Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

arXiv 2025

[23] [23]

W. Shen, Y . Liu, Y . Wu, Z. Liang, S. Gu, D. Wang, T. Nian, L. Xu, Y . Qin, J. Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning. arXiv preprint arXiv:2510.14300, 2025

arXiv 2025

[24] [24]

J. Wen, Y . Zhu, M. Zhu, Z. Tang, J. Li, Z. Zhou, X. Liu, C. Shen, Y . Peng, and F. Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-second International Conference on Machine Learning, 2025

2025

[25] [25]

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025

[26] [26]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277– 54296, 2025. 10

2025

[27] [27]

H. Tan, P. Co, Y . Xu, S. Rong, Y . Ji, C. Chi, X. Chen, Q. Zhang, Z. Zhao, P. Wang, et al. Action- sketcher: From reasoning to action via visual sketches for long-horizon robotic manipulation. arXiv preprint arXiv:2601.01618, 2026

arXiv 2026

[28] [28]

H. Wang, Z. Jing, J. Ao, S. Song, X. Li, G. Huang, and C. Bai. Beyond short-horizon: Vq- memory for robust long-horizon manipulation in non-markovian simulation benchmarks.arXiv preprint arXiv:2603.09513, 2026

arXiv 2026

[29] [29]

Torne, A

M. Torne, A. Tang, Y . Liu, and C. Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025

arXiv 2025

[30] [30]

Y .-L. Wei, H. Liao, Y . Lin, P. Wang, Z. Liang, G. Liu, and W.-S. Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

arXiv 2025

[31] [31]

H. Jang, S. Yu, H. Kwon, H. Jeon, Y . Seo, and J. Shin. Contextvla: Vision-language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025

arXiv 2025

[32] [32]

M. Lin, X. Liang, B. Lin, L. Jingzhi, Z. Jiao, K. Li, Y . Ma, Y . Liu, S. Zhao, Y . Zhuang, et al. Echovla: Robotic vision-language-action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025

arXiv 2025

[33] [33]

Y . Lei, Z. Liang, H. Zhang, and P. Luo. Vpwem: Non-markovian visuomotor policy with working and episodic memory.arXiv preprint arXiv:2603.04910, 2026

arXiv 2026

[34] [34]

L. Tan, J. Li, and G. Jing. Memoact: Atkinson-shiffrin-inspired memory-augmented visuomotor policy for robotic manipulation.arXiv preprint arXiv:2603.18494, 2026

arXiv 2026

[35] [35]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[36] [36]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024

[37] [37]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T.-k. Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

arXiv 2024

[38] [38]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart´ın-Mart´ın, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023

[39] [39]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024

[40] [40]

H. Fang, M. Grotz, W. Pumacay, Y . R. Wang, D. Fox, R. Krishna, and J. Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

arXiv 2025

[41] [41]

Cherepanov, N

E. Cherepanov, N. Kachaev, A. K. Kovalev, and A. I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550, 2025

arXiv 2025

[42] [42]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023. 11

2023

[43] [43]

S. Han, B. Qiu, Y . Liao, S. Huang, C. Gao, S. Yan, and S. Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.Advances in Neural Information Processing Systems, 38, 2026

2026

[44] [44]

H. Lei, W. Song, H. Zhang, J. Pei, J. Chen, H. Yan, H. Zhao, P. Ding, Z. Zhang, L. Huang, et al. Robomemarena: A comprehensive and challenging robotic memory benchmark.arXiv preprint arXiv:2605.10921, 2026

Pith/arXiv arXiv 2026

[45] [45]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020

[46] [46]

Community

S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026

[47] [47]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[48] [48]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12 Appendix A Implementation Details of EventVLA A.1 Training Formulations and Curriculum ...

2023

[49] [49]

Find key state transitions for this task (e.g., stable grasp acquired, object placed, cycle transition)

[50] [50]

Keep keyframes representative and temporally ordered across the full task progress

[51] [51]

Output format constraints:

For repeated pick/place cycles, pick the most stable and recognizable moments per cycle. Output format constraints:

[52] [52]

No markdown, no explanations

Return JSON only. No markdown, no explanations

[53] [53]

Format:{”keyframe steps”: [int, int, ...]} 3.keyframe stepsmust: - have length exactly<num keyframes> - be strictly increasing - be in [0,<total frames - 1>] - contain no duplicates Annotation Reliability and Error Analysis.To rigorously validate the reliability of this automated pipeline, we conducted a comprehensive cross-validation study. In the simula...

2000