Recognition: 2 theorem links
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Pith reviewed 2026-05-15 20:39 UTC · model grok-4.3
The pith
MemoryVLA adds a perceptual-cognitive memory bank to vision-language-action models to supply temporal context for long-horizon robotic manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A pretrained vision-language model produces perceptual and cognitive tokens that serve as working memory. These tokens interact with a Perceptual-Cognitive Memory Bank that retains low-level visual details and high-level semantic summaries. Adaptive retrieval selects relevant past entries, fuses them with the current tokens, and merges redundancies before updating the bank. The resulting memory-conditioned tokens drive a diffusion-based action expert that outputs sequences aware of temporal dependencies.
What carries the argument
The Perceptual-Cognitive Memory Bank, which stores and adaptively retrieves low-level perceptual details together with high-level semantic gist from prior observations.
If this is right
- Robots can complete manipulation sequences that span many steps without requiring hand-crafted history features.
- Gains are concentrated on tasks with explicit temporal dependencies while general skills remain competitive.
- The same memory bank can be paired with any pretrained vision-language model and any diffusion action head.
- Success rates of 71.9 percent on SimplerEnv-Bridge, 72.7 percent on Fractal, 96.5 percent on LIBERO-5, and 84.0 percent on twelve real-world tasks become achievable.
Where Pith is reading between the lines
- The same retrieval-and-fusion pattern could be tested in other sequential domains such as autonomous driving or game playing where history matters.
- Scaling the bank size or adding decay mechanisms might be needed if the number of stored entries grows large.
- Comparing the bank against a simple transformer memory layer on the same tasks would isolate the benefit of the perceptual-cognitive split.
Load-bearing premise
Adaptive retrieval, fusion, and redundancy merging will consistently deliver useful past context without injecting noise or stale information that degrades action generation.
What would settle it
Ablating the memory bank entirely and measuring no drop (or an increase) in success rate specifically on the long-horizon real-world tasks would falsify the claim that the bank supplies necessary temporal context.
read the original abstract
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, LIBERO-5 suites and Mikasa-Robo, it achieves 71.9%, 72.7%, 96.5%, and 41.2% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge and +11.8 gain on Mikasa-Robo. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MemoryVLA, a Cognition-Memory-Action framework for vision-language-action (VLA) models in robotic manipulation. It introduces a Perceptual-Cognitive Memory Bank that stores low-level perceptual details and high-level semantic tokens from a pretrained VLM, with working memory performing adaptive retrieval, fusion, and redundancy merging to supply temporal context. A memory-conditioned diffusion action expert then generates actions. The method is evaluated on 150+ simulation and real-world tasks across three robots, reporting success rates of 71.9% on SimplerEnv-Bridge, 72.7% on Fractal, 96.5% on LIBERO-5, 41.2% on Mikasa-Robo, and 84.0% on 12 real-world tasks (with +26 gain on long-horizon subsets), outperforming baselines CogACT and pi-0.
Significance. If the memory bank's operations are shown to be the causal driver of the reported gains, the work would meaningfully advance VLA models for non-Markovian, long-horizon manipulation by explicitly incorporating cognitive-inspired memory mechanisms. The breadth of evaluation across public simulation suites and real-world tasks on multiple robots provides a solid empirical foundation for assessing temporal awareness in action generation.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The central claim that the Perceptual-Cognitive Memory Bank's adaptive retrieval, fusion, and redundancy-merging steps supply temporally relevant context and drive the +26 improvement on long-horizon tasks is not supported by any ablation that isolates these operations. No results are shown for a controlled no-memory VLA variant that retains the same VLM encoder and diffusion expert while removing the memory bank, leaving open the possibility that gains arise from data scale, training differences, or the expert architecture instead.
- [Results] Results section: Reported success rates (e.g., 84.0% real-world, 71.9% on Bridge) are given as point estimates without error bars, number of evaluation trials, or statistical tests comparing to baselines, which is load-bearing for claims of consistent outperformance on temporally dependent tasks.
minor comments (2)
- [Abstract] The abstract states evaluation on '150+ simulation and real-world tasks' but does not break down the exact counts per suite or identify which of the 12 real-world tasks are the long-horizon subset used for the +26 gain.
- [Method] Method description of the memory bank's redundancy-merging step would benefit from pseudocode or explicit update rules to clarify how stale or noisy entries are handled during fusion.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim that the Perceptual-Cognitive Memory Bank's adaptive retrieval, fusion, and redundancy-merging steps supply temporally relevant context and drive the +26 improvement on long-horizon tasks is not supported by any ablation that isolates these operations. No results are shown for a controlled no-memory VLA variant that retains the same VLM encoder and diffusion expert while removing the memory bank, leaving open the possibility that gains arise from data scale, training differences, or the expert architecture instead.
Authors: We agree that an explicit controlled ablation isolating the memory bank is necessary to strengthen the causal link between the memory mechanisms and the reported gains. In the revised manuscript we will add a new ablation that compares the full MemoryVLA model against a no-memory variant that uses exactly the same pretrained VLM encoder and diffusion action expert, with the memory bank, retrieval, fusion, and redundancy-merging components removed. This will be reported in the Experiments section with the same evaluation protocol. revision: yes
-
Referee: [Results] Results section: Reported success rates (e.g., 84.0% real-world, 71.9% on Bridge) are given as point estimates without error bars, number of evaluation trials, or statistical tests comparing to baselines, which is load-bearing for claims of consistent outperformance on temporally dependent tasks.
Authors: We acknowledge that the current presentation lacks measures of variability and statistical comparison. In the revised Results section we will report the exact number of evaluation trials per task (50 trials for simulation suites and 10 trials for each real-world task), include error bars showing standard deviation across three independent training seeds, and add paired t-test p-values comparing MemoryVLA against CogACT and pi-0 on the long-horizon subsets. Updated tables and figures will reflect these additions. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent benchmark comparisons
full rationale
The paper presents an architectural proposal (Perceptual-Cognitive Memory Bank with retrieval/fusion/merging) whose performance is measured via direct success-rate comparisons on public suites (SimplerEnv-Bridge, LIBERO-5, Mikasa-Robo) and 12 real-world tasks against external baselines (CogACT, pi-0). No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input; the reported gains (+14.6 on Bridge, +26 on long-horizon real tasks) are obtained from held-out evaluation rather than any internal normalization or self-referential loop. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Robotic manipulation tasks are inherently non-Markovian and require temporal context.
invented entities (1)
-
Perceptual-Cognitive Memory Bank
no independent evidence
Lean theorems connected to this paper
-
HierarchyEmergence / PhiForcinghierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
Towards Generalizable Robotic Manipulation in Dynamic Environments
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
-
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
-
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000...
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
-
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Egor Cherepanov, Nikita Kachaev, Alexey K Kovalev, and Aleksandr I Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550,
-
[7]
Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152,
-
[8]
Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representa- tions and paradigms in vision-language-action models.arXiv preprint arXiv:2506.17561,
-
[9]
Rvt: Robotic view transformer for 3d object manipulation
12 Published as a conference paper at ICLR 2026 Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pp. 694–710. PMLR,
work page 2026
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty El- lis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Opti- mizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yiz...
-
[16]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024b. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledg...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual- language-action model.arXiv preprint arXiv:2501.15830,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996,
-
[21]
14 Published as a conference paper at ICLR 2026 Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Spatialactor: Exploring disentangled spatial representations for robust robotic manipula- tion.arXiv preprint arXiv:2511.09555,
-
[22]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[23]
Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071,
-
[24]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511,
Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, et al. Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511,
-
[28]
Magma: A foundation model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14203–14214, 2025a. Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Fe...
-
[29]
Sigmoid loss for language image pre-training
15 Published as a conference paper at ICLR 2026 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986,
work page 2026
-
[30]
Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, and Xingping Dong. Groun...
-
[31]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Henry Zheng, Hao Shi, Yong Xien Chng, Rui Huang, Zanlin Ni, Tianyi Tan, Qihang Peng, Yepeng Weng, Zhongchao Shi, and Gao Huang. Denseg: Alleviating vision-language feature sparsity in multi-view 3d visual grounding. InAutonomous Grand Challenge CVPR 2024 Workshop, volume 2, pp. 6, 2024a. Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
16 Published as a conference paper at ICLR 2026 APPENDIX A LLM Usage 18 B Robustness and Generalization Evaluation 18 B.1 Real-world Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Simulation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C Additional Training Details 19 C.1 Hyper-paramete...
work page 2026
-
[34]
comprises five memory-dependent manip- ulation tasks, each with 250 officially provided demonstrations, using∆end-effector control. 21 Published as a conference paper at ICLR 2026 Real-worldWe collect real demonstrations on Franka and WidowX robots using a fixed third- person RGB setup, as shown in Fig. 8 and
work page 2026
-
[35]
SimplerEnv-FractalModels are trained for 80k steps on RT-1. The benchmark defines two pro- tocols: Visual Matching (VM), which mirrors the real-robot setup, and Visual Aggregation (V A), which perturbs background, lighting, distractors, and textures to test robustness. The dataloader design and memory length follow the same setup as in SimplerEnv-Bridge. ...
work page 2024
-
[36]
We reuse the same dataloader setup as in LIBERO, and set the memory length to
Mikasa-RoboFollowing Mikasa-Robo (Cherepanov et al., 2025), we adopt the standard protocol with five tasks and train jointly on all 1,250 demonstrations for 20k steps, using128×128RGB observations and∆end-effector control. We reuse the same dataloader setup as in LIBERO, and set the memory length to
work page 2025
-
[37]
Real-worldModels are trained for 5k-20k steps depending on task and dataset size. The gen- eral tasks contain 50-150 demonstrations per task, while long-horizon temporal tasks use 200-300 demonstrations per task. The memory length is set to 16 for general tasks and 256 for long-horizon temporal tasks. C.4 DATAAUGMENTATION We apply standard per-frame augme...
work page 2026
-
[38]
Empty grasps with clear counting intent incur a 5-point penalty. 23 Published as a conference paper at ICLR 2026 •Pick Place Order:carrot, banana, and orange must be picked and placed in sequence. Each correct step earns 30, with a 10 bonus for full completion. Any order violation terminates the attempt. •Clean Restaurant Table:five objects in total. Each...
work page 2026
-
[39]
E CASESTUDY OFMEMORYRETRIEVAL To provide a direct view of how the memory mechanism functions, Fig. 10 visualizes the retrieved memory elements and their attention weights on the real-world and simulation tasks. The model consistently attends to past frames that resolve decision-relevant ambiguities absent from the current observation. In the real-world Ch...
work page 2026
-
[40]
Table 10:Additional ablation on memory length for both real-world and LIBERO-90 tasks
to LIBERO-Long-90 and a real-world long-horizon task. Table 10:Additional ablation on memory length for both real-world and LIBERO-90 tasks. (a) Real-World: Clean Table & Count Memory Length Success Rate 64 78 256 (Base) 84 512 81 (b) LIBERO-Long-90 Tasks Memory Length Success Rate 8 94.2 16 (Base) 95.6 32 95.6 Table 11:Ablation on the Number of Cognitive...
work page 2026
-
[41]
Push the{color 1, color 2, and color 3} buttons in sequence
Latency (HGX H20) Throughput (HGX H20) Memory Baseline 0.187 s 85.6 Hz 0.236 s 67.8 Hz 15.8 GB MemoryVLA 0.194 s 82.5 Hz 0.246 s 65.0 Hz 16.6 GB I ZERO-SHOTTASKGENERALIZATION In addition to visual OOD tests, we have added task generalization experiments to evaluate zero shot performance on unseen task categories. As shown in Tab. 16, We use Apple To Baske...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.