Key-Gram: Extensible World Knowledge for Embodied Manipulation

Botao Ren; Jingjing Fan; Siyuan Li; Zhidong Deng

arxiv: 2605.18556 · v1 · pith:5VQMO4GKnew · submitted 2026-05-18 · 💻 cs.RO · cs.AI

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Jingjing Fan , Siyuan Li , Botao Ren , Zhidong Deng This is my paper

Pith reviewed 2026-05-20 09:05 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords embodied manipulationkey-gramexternal memoryvision-language-actioncompositional instructionsrobot controlknowledge extensiontransfer learning

0 comments

The pith

Key-Gram decouples linguistic knowledge from visual reasoning in embodied policies using an external memory of key-grams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Key-Gram to address the coupling of language and visual computation in current vision-language-action policies. It introduces a memory module that decomposes instructions into task-specific key-grams and retrieves static linguistic priors via deterministic hashed lookup. These entries are then injected into selected hidden layers through context-aware gating and lightweight convolutional fusion. This separation lets the backbone focus on visual reasoning and action inference while linguistic knowledge stays in an extensible external store. Experiments demonstrate consistent gains across simulation benchmarks and real-world dual-arm manipulation tasks for two different backbones.

Core claim

Key-Gram is a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning by decomposing an instruction into task-specific key-grams, retrieving static linguistic priors through deterministic hashed lookup, and injecting the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion, allowing the backbone to devote its main capacity to visual reasoning and action inference while reusable instruction knowledge is stored in an extensible external memory.

What carries the argument

Memory module that decomposes instructions into task-specific key-grams, retrieves linguistic priors via deterministic hashed lookup, and injects them into hidden layers via context-aware gating and convolutional fusion.

If this is right

Improves both π0 and π0.5 backbones with average relative gains of 29.5 percent and 9.9 percent on RoboTwin2.0.
Achieves 35.8 percent and 4.5 percent gains on LIBERO-Plus transfer without target-domain fine-tuning.
Delivers 15.4 percent and 8.1 percent gains on real-world long-horizon dual-arm tasks.
Allows the logical memory table to be partitioned during training and placed on host memory with O(1) lookup at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Editing the memory table alone could add new world knowledge to a deployed policy without any backbone retraining.
The constant-time lookup pattern may allow the same architecture to scale to much larger instruction sets in real-time control.
Partitioning the memory table by domain during training could support rapid adaptation to new task families.

Load-bearing premise

That decomposing instructions into task-specific key-grams and retrieving static linguistic priors through deterministic hashed lookup can be injected into selected hidden layers via context-aware gating without losing critical information or introducing new interference with visual reasoning.

What would settle it

A controlled experiment in which Key-Gram is added to the π0 or π0.5 backbone and produces no improvement or a measurable drop in success rates on RoboTwin2.0 or LIBERO-Plus would show that the injection step fails to enhance or actively harms visual reasoning.

Figures

Figures reproduced from arXiv: 2605.18556 by Botao Ren, Jingjing Fan, Siyuan Li, Zhidong Deng.

**Figure 2.** Figure 2: Extensible memory allocation of Key-Gram. The memory is a logical table composed of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Demonstrations show the execution process of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples from real-world expansion tasks. Both [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-placement ablation on RoboTwin2.0. Shaded curves denote the weighted task score, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $\pi_{0}$ and $\pi_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Key-Gram, a conditional-memory framework for embodied manipulation policies that decouples language-derived world knowledge from visual-state reasoning. Instructions are decomposed into task-specific key-grams whose static linguistic priors are retrieved via deterministic hashed lookup and injected into selected hidden layers of the backbone (π0 or π0.5) through context-aware gating plus lightweight convolutional fusion. The external memory is claimed to be extensible and O(1) lookup efficient. Empirical results report average relative gains of 29.5%/9.9% on RoboTwin2.0, 35.8%/4.5% on LIBERO-Plus zero-shot transfer, and 15.4%/8.1% on real-world long-horizon dual-arm tasks.

Significance. If the central mechanism is shown to deliver the claimed separation without modality interference or capacity-driven artifacts, the work would offer a practical route to extensible linguistic priors in vision-language-action models, reducing the cost of knowledge updates and improving compositional transfer. The reported gains on standard benchmarks and real-world tasks would be noteworthy for the robotics community if properly controlled.

major comments (3)

[§3.2] §3.2 (Context-aware gating and fusion): The manuscript provides no layer-wise activation analysis, content-controlled ablations (e.g., random vs. retrieved key-grams), or interference metrics to verify that the injected priors leave visual reasoning intact and do not introduce modality competition. This is load-bearing for the claim that externalization, rather than added parameters or fusion capacity, drives the reported gains.
[§4] §4 (Experimental protocol): Relative performance gains are stated without reporting number of random seeds, statistical significance tests, error bars, exact baseline implementations, or controls that isolate the contribution of the memory module versus the added gating/fusion parameters. This prevents assessment of whether the data support the mechanism-level claims.
[§4.3] §4.3 (Ablation studies): No ablation removes the retrieved linguistic content while retaining the gating and fusion architecture, leaving open the possibility that performance improvements stem from architectural capacity rather than the extensible memory design.

minor comments (2)

Notation for π0 and π0.5 backbones should be defined on first use and cross-referenced to the original papers.
Figure 3 (memory table visualization) would benefit from an explicit legend distinguishing hashed keys from retrieved priors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Context-aware gating and fusion): The manuscript provides no layer-wise activation analysis, content-controlled ablations (e.g., random vs. retrieved key-grams), or interference metrics to verify that the injected priors leave visual reasoning intact and do not introduce modality competition. This is load-bearing for the claim that externalization, rather than added parameters or fusion capacity, drives the reported gains.

Authors: We agree that demonstrating the lack of modality interference is important for validating our central claim. In the revised version, we will add layer-wise activation analysis showing the impact of key-gram injection on visual features. Additionally, we will include content-controlled ablations using random key-grams and report quantitative interference metrics, such as the change in visual feature norms and cross-modal attention scores. These will help confirm that the gains arise from the externalized knowledge rather than capacity increases. revision: yes
Referee: [§4] §4 (Experimental protocol): Relative performance gains are stated without reporting number of random seeds, statistical significance tests, error bars, exact baseline implementations, or controls that isolate the contribution of the memory module versus the added gating/fusion parameters. This prevents assessment of whether the data support the mechanism-level claims.

Authors: We acknowledge the need for more rigorous statistical reporting. The experiments were run with 5 random seeds; we will report mean and standard deviation with error bars in the updated figures. We will also include statistical significance tests (e.g., t-tests) comparing Key-Gram to baselines. We will clarify the baseline implementations by referencing the exact code versions and hyperparameters used. To isolate the memory contribution, we plan to add a control where the fusion modules are active but fed with non-informative inputs. revision: yes
Referee: [§4.3] §4.3 (Ablation studies): No ablation removes the retrieved linguistic content while retaining the gating and fusion architecture, leaving open the possibility that performance improvements stem from architectural capacity rather than the extensible memory design.

Authors: This observation is correct, and we will address it by adding the requested ablation in the revised Section 4.3. Specifically, we will train and evaluate a variant where the key-gram lookup returns empty or random vectors, while keeping the gating and convolutional fusion layers intact. The performance difference between this variant and the full Key-Gram will quantify the benefit of the linguistic content over mere architectural additions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains from design choice, not self-referential derivation

full rationale

The paper introduces Key-Gram as an architectural design that decomposes instructions into key-grams, retrieves priors via hashed lookup, and injects them via gating and fusion to separate linguistic memory from visual reasoning. Reported improvements (e.g., 29.5%/9.9% on RoboTwin2.0) are presented as outcomes of experiments on standard benchmarks rather than predictions derived from equations or first principles. No load-bearing step reduces a claimed result to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled through prior work. The central mechanism is a proposed engineering separation whose effectiveness is tested externally on held-out tasks and real-world scenarios, keeping the derivation self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities with independent evidence are detailed. The framework itself introduces a new memory structure for linguistic priors as the core contribution.

invented entities (1)

key-grams no independent evidence
purpose: Task-specific decomposition of language instructions for deterministic memory retrieval
Introduced in the abstract as the core mechanism for breaking down instructions, but no independent validation or external evidence is provided.

pith-pipeline@v0.9.0 · 5837 in / 1258 out tokens · 47342 ms · 2026-05-20T09:05:28.309582+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The logical memory table can be conveniently partitioned during training and, due to its O(1) lookup pattern, efficiently placed on host memory during inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

[1]

(2023) RT-2: Vision-language-action models transfer web knowledge to robotic control

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V ., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., et al. (2023) RT-2: Vision-language-action models transfer web knowledge to robotic control. In J. Tan, M. Toussaint and K. Darv...

work page 2023
[2]

& Finn, C

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P. & Finn, C. (2025) OpenVLA: An open-source vision-language-action model. In P. Agrawal, O. Kroemer and W. Burgard (eds.),Proceedings of The 8th Conferenc...

work page 2025
[3]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C. & Liang, P. (2025) Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., et al. (2024) π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

(2025)π 0.5: A vision-language-action model with open-world generalization

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y ., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., et al. (2025)π 0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Confere...

work page 2025
[6]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H. & Zhu, J. (2024) RDT-1B: A diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

& Zhan, X

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y ., Zheng, Y ., Zou, J., Chen, Y ., Zeng, J., Zhang, Y .-Q., Pang, J., Liu, J., Wang, T. & Zhan, X. (2026) X-VLA: Soft-prompted transformer as scalable cross- embodiment vision-language-action model. InInternational Conference on Learning Representations

work page 2026
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., et al. (2025) GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

GR-3 Technical Report

Cheang, C.L., Chen, S., Cui, Z., Hu, Y ., Huang, L., Kong, T., Li, H., Li, Y ., Liu, Y ., Ma, X., Niu, H., Ou, W., Peng, W., Ren, Z., Shi, H., Tian, J., Wu, H., Xiao, X., Xiao, Y ., Xu, J. & Yang, Y . (2025) GR-3 technical report.arXiv preprint arXiv:2507.15493

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang, C.L., Chen, G., Jing, Y ., Kong, T., Li, H., Li, Y ., Liu, Y ., Wu, H., Xu, J., Yang, Y ., Zhang, H. & Zhu, M. (2024) GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Enerverse: Envisioning embodied future space for robotics manipulation

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M. & Ren, G. (2025) EnerVerse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895

work page arXiv 2025
[12]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y ., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y ., Hu, Y ., Cai, J., Liu, S., Luo, J., Chen, L., Yan, S., Yao, M. & Ren, G. (2025) Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

& Huang, S

Lu, G., Jia, B., Li, P., Chen, Y ., Wang, Z., Tang, Y . & Huang, S. (2025) GWM: Towards scalable Gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision

work page 2025
[14]

Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y . & Xu, Y . (2026) Causal world modeling for robot control.arXiv preprint arXiv:2601.21998

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

World Action Models are Zero-shot Policies

Ye, S., Ge, Y ., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y .L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y ., Wang, G., Hu, F., Narayan, A., Bjorck, J., et al. (2026) World action models are zero-shot policies.arXiv preprint arXiv:2602.15922

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

& Song, S

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burchfiel, B., Tedrake, R. & Song, S. (2023) Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems. 11

work page 2023
[17]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., Deng, W., Guo, Y ., Nian, T., Xie, X., Chen, Q., Su, K., Xu, T., Liu, G., Hu, M., Gao, H., et al. (2025) RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

& Stone, P

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y . & Stone, P. (2023) LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems 36, pp. 44776– 44791

work page 2023
[19]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J. & Qiu, X. (2025) LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

& Jégou, H

Lample, G., Sablayrolles, A., Ranzato, M.A., Denoyer, L. & Jégou, H. (2019) Large memory layers with product keys. InAdvances in Neural Information Processing Systems 32

work page 2019
[21]

& Chang, M.-W

Guu, K., Lee, K., Tung, Z., Pasupat, P. & Chang, M.-W. (2020) REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, pp. 3929–

work page 2020
[22]

(2022) Improving language models by retrieving from trillions of tokens

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G.B., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., et al. (2022) Improving language models by retrieving from trillions of tokens. InProceedin...

work page 2022
[23]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D. & Liang, W. (2026) Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

& Cai, X

Liu, H., Zhang, J., Wang, C., Hu, X., Lyu, L., Sun, J., Yang, X., Wang, B., Li, F., Qian, Y ., Si, L., Sun, Y ., Li, R., Pei, P., Xie, Y . & Cai, X. (2026) Scaling embeddings outperforms scaling experts in language models.arXiv preprint arXiv:2601.21204

work page arXiv 2026
[25]

Meki: Memory-based expert knowledge injection for efficient llm scaling.arXiv preprint arXiv:2602.03359,

Ding, N., Liu, F., Kim, K., Hao, L., Lee, K.-H., Ko, H. & Tang, Y . (2026) MeKi: Memory-based expert knowledge injection for efficient LLM scaling.arXiv preprint arXiv:2602.03359

work page arXiv 2026
[26]

Accessed May 7, 2026

Google (2026) Gemma 4 model overview.Google AI for Developers Documentation. Accessed May 7, 2026

work page 2026
[27]

& Courville, A

Perez, E., Strub, F., de Vries, H., Dumoulin, V . & Courville, A. (2018) FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence32(1)

work page 2018
[28]

Dumoulin , author E

Dumoulin, V ., Perez, E., Schucher, N., Strub, F., de Vries, H., Courville, A. & Bengio, Y . (2018) Feature- wise transformations.Distill. doi:10.23915/distill.00011

work page doi:10.23915/distill.00011 2018
[29]

& Xie, S

Peebles, W. & Xie, S. (2023) Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205

work page 2023
[30]

& Levine, S

Dasari, S., Mees, O., Zhao, S., Srirama, M.K. & Levine, S. (2024) The ingredients for robotic diffusion transformers.arXiv preprint arXiv:2410.10088

work page arXiv 2024
[31]

& Cohen, N.J

McCloskey, M. & Cohen, N.J. (1989) Catastrophic interference in connectionist networks: The sequential learning problem. In G.H. Bower (ed.),Psychology of Learning and Motivation, V ol.24, pp. 109–165. Academic Press

work page 1989
[32]

(1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

French, R.M. (1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

work page 1999
[33]

& Kiela, D

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S. & Kiela, D. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33, pp. 9459–9474

work page 2020
[34]

& Wei, F

Wang, W., Dong, L., Cheng, H., Liu, X., Yan, X., Gao, J. & Wei, F. (2023) Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems 36, pp. 74530–74543

work page 2023
[35]

& Szegedy, C

Wu, Y ., Rabe, M.N., Hutchins, D. & Szegedy, C. (2022) Memorizing transformers. InInternational Conference on Learning Representations. 12 A Technical appendices and supplementary material A.1 Full RoboTwin2.0 Results Table 6: Full RoboTwin2.0 results (%). Gains in parentheses for KG variants are relative improvements over their corresponding base backbon...

work page 2022
[36]

Output exactly 8 keywords

work page
[37]

Each keyword must contain 2 to 4 words

work page
[38]

Prefer high-information phrases that combine multiple semantic roles in one phrase

work page
[39]

Prefer action-centered phrases over static descriptive phrases whenever possible

work page
[40]

At least 3 of the 8 keywords must explicitly contain an action verb

work page
[41]

verb + object + relation/target/source b

Prefer these phrase types, in this priority order: a. verb + object + relation/target/source b. verb + particle + object c. verb + prep + object d. object + prep + object e. attribute + object

work page
[42]

A good keyword should ideally compress 2 or more semantic elements, such as: - action + object - action + object + source - action + object + target - object + attribute - object + location

work page
[43]

Use standalone static noun phrases only when they add important information that is not already covered elsewhere

work page
[44]

Use at most 5 standalone noun phrases

work page
[45]

If a static phrase can be replaced by a more informative action phrase, prefer the action phrase

work page
[46]

pick up" -

Prefer phrases like: - "pick up" - "pick bowl from drawer" - "pick up bowl" - "place bowl on plate" - "bowl in top drawer" - "black bowl"

work page
[47]

place it on

Avoid: - fragmented phrases - fake combinations across unrelated spans - pronoun-centered phrases like "place it on" - low-information phrases 14 - too many static environment phrases - duplicated semantics across multiple keywords - more than 4 words in a keyword - less or more than 8 keywords

work page
[48]

Do not explain anything

work page
[49]

keywords

Return valid JSON only. Example: Instruction: pick up the green sponge from the sink and wipe the wooden table near the window Output: { "keywords": [ "pick and wipe", "pick sponge from sink", "pick up sponge", "green sponge", "wipe wooden table", "wipe table near window", "table near window", "wooden table" ] } MUST FOLLOW: - Do NOT less or more than 8 k...

work page

[1] [1]

(2023) RT-2: Vision-language-action models transfer web knowledge to robotic control

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V ., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., et al. (2023) RT-2: Vision-language-action models transfer web knowledge to robotic control. In J. Tan, M. Toussaint and K. Darv...

work page 2023

[2] [2]

& Finn, C

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P. & Finn, C. (2025) OpenVLA: An open-source vision-language-action model. In P. Agrawal, O. Kroemer and W. Burgard (eds.),Proceedings of The 8th Conferenc...

work page 2025

[3] [3]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C. & Liang, P. (2025) Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., et al. (2024) π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

(2025)π 0.5: A vision-language-action model with open-world generalization

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y ., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., et al. (2025)π 0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Confere...

work page 2025

[6] [6]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H. & Zhu, J. (2024) RDT-1B: A diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

& Zhan, X

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y ., Zheng, Y ., Zou, J., Chen, Y ., Zeng, J., Zhang, Y .-Q., Pang, J., Liu, J., Wang, T. & Zhan, X. (2026) X-VLA: Soft-prompted transformer as scalable cross- embodiment vision-language-action model. InInternational Conference on Learning Representations

work page 2026

[8] [8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., et al. (2025) GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

GR-3 Technical Report

Cheang, C.L., Chen, S., Cui, Z., Hu, Y ., Huang, L., Kong, T., Li, H., Li, Y ., Liu, Y ., Ma, X., Niu, H., Ou, W., Peng, W., Ren, Z., Shi, H., Tian, J., Wu, H., Xiao, X., Xiao, Y ., Xu, J. & Yang, Y . (2025) GR-3 technical report.arXiv preprint arXiv:2507.15493

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang, C.L., Chen, G., Jing, Y ., Kong, T., Li, H., Li, Y ., Liu, Y ., Wu, H., Xu, J., Yang, Y ., Zhang, H. & Zhu, M. (2024) GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Enerverse: Envisioning embodied future space for robotics manipulation

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M. & Ren, G. (2025) EnerVerse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895

work page arXiv 2025

[12] [12]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y ., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y ., Hu, Y ., Cai, J., Liu, S., Luo, J., Chen, L., Yan, S., Yao, M. & Ren, G. (2025) Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

& Huang, S

Lu, G., Jia, B., Li, P., Chen, Y ., Wang, Z., Tang, Y . & Huang, S. (2025) GWM: Towards scalable Gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision

work page 2025

[14] [14]

Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y . & Xu, Y . (2026) Causal world modeling for robot control.arXiv preprint arXiv:2601.21998

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

World Action Models are Zero-shot Policies

Ye, S., Ge, Y ., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y .L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y ., Wang, G., Hu, F., Narayan, A., Bjorck, J., et al. (2026) World action models are zero-shot policies.arXiv preprint arXiv:2602.15922

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

& Song, S

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burchfiel, B., Tedrake, R. & Song, S. (2023) Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems. 11

work page 2023

[17] [17]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., Deng, W., Guo, Y ., Nian, T., Xie, X., Chen, Q., Su, K., Xu, T., Liu, G., Hu, M., Gao, H., et al. (2025) RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

& Stone, P

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y . & Stone, P. (2023) LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems 36, pp. 44776– 44791

work page 2023

[19] [19]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J. & Qiu, X. (2025) LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

& Jégou, H

Lample, G., Sablayrolles, A., Ranzato, M.A., Denoyer, L. & Jégou, H. (2019) Large memory layers with product keys. InAdvances in Neural Information Processing Systems 32

work page 2019

[21] [21]

& Chang, M.-W

Guu, K., Lee, K., Tung, Z., Pasupat, P. & Chang, M.-W. (2020) REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, pp. 3929–

work page 2020

[22] [22]

(2022) Improving language models by retrieving from trillions of tokens

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G.B., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., et al. (2022) Improving language models by retrieving from trillions of tokens. InProceedin...

work page 2022

[23] [23]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D. & Liang, W. (2026) Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

& Cai, X

Liu, H., Zhang, J., Wang, C., Hu, X., Lyu, L., Sun, J., Yang, X., Wang, B., Li, F., Qian, Y ., Si, L., Sun, Y ., Li, R., Pei, P., Xie, Y . & Cai, X. (2026) Scaling embeddings outperforms scaling experts in language models.arXiv preprint arXiv:2601.21204

work page arXiv 2026

[25] [25]

Meki: Memory-based expert knowledge injection for efficient llm scaling.arXiv preprint arXiv:2602.03359,

Ding, N., Liu, F., Kim, K., Hao, L., Lee, K.-H., Ko, H. & Tang, Y . (2026) MeKi: Memory-based expert knowledge injection for efficient LLM scaling.arXiv preprint arXiv:2602.03359

work page arXiv 2026

[26] [26]

Accessed May 7, 2026

Google (2026) Gemma 4 model overview.Google AI for Developers Documentation. Accessed May 7, 2026

work page 2026

[27] [27]

& Courville, A

Perez, E., Strub, F., de Vries, H., Dumoulin, V . & Courville, A. (2018) FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence32(1)

work page 2018

[28] [28]

Dumoulin , author E

Dumoulin, V ., Perez, E., Schucher, N., Strub, F., de Vries, H., Courville, A. & Bengio, Y . (2018) Feature- wise transformations.Distill. doi:10.23915/distill.00011

work page doi:10.23915/distill.00011 2018

[29] [29]

& Xie, S

Peebles, W. & Xie, S. (2023) Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205

work page 2023

[30] [30]

& Levine, S

Dasari, S., Mees, O., Zhao, S., Srirama, M.K. & Levine, S. (2024) The ingredients for robotic diffusion transformers.arXiv preprint arXiv:2410.10088

work page arXiv 2024

[31] [31]

& Cohen, N.J

McCloskey, M. & Cohen, N.J. (1989) Catastrophic interference in connectionist networks: The sequential learning problem. In G.H. Bower (ed.),Psychology of Learning and Motivation, V ol.24, pp. 109–165. Academic Press

work page 1989

[32] [32]

(1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

French, R.M. (1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

work page 1999

[33] [33]

& Kiela, D

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S. & Kiela, D. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33, pp. 9459–9474

work page 2020

[34] [34]

& Wei, F

Wang, W., Dong, L., Cheng, H., Liu, X., Yan, X., Gao, J. & Wei, F. (2023) Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems 36, pp. 74530–74543

work page 2023

[35] [35]

& Szegedy, C

Wu, Y ., Rabe, M.N., Hutchins, D. & Szegedy, C. (2022) Memorizing transformers. InInternational Conference on Learning Representations. 12 A Technical appendices and supplementary material A.1 Full RoboTwin2.0 Results Table 6: Full RoboTwin2.0 results (%). Gains in parentheses for KG variants are relative improvements over their corresponding base backbon...

work page 2022

[36] [36]

Output exactly 8 keywords

work page

[37] [37]

Each keyword must contain 2 to 4 words

work page

[38] [38]

Prefer high-information phrases that combine multiple semantic roles in one phrase

work page

[39] [39]

Prefer action-centered phrases over static descriptive phrases whenever possible

work page

[40] [40]

At least 3 of the 8 keywords must explicitly contain an action verb

work page

[41] [41]

verb + object + relation/target/source b

Prefer these phrase types, in this priority order: a. verb + object + relation/target/source b. verb + particle + object c. verb + prep + object d. object + prep + object e. attribute + object

work page

[42] [42]

A good keyword should ideally compress 2 or more semantic elements, such as: - action + object - action + object + source - action + object + target - object + attribute - object + location

work page

[43] [43]

Use standalone static noun phrases only when they add important information that is not already covered elsewhere

work page

[44] [44]

Use at most 5 standalone noun phrases

work page

[45] [45]

If a static phrase can be replaced by a more informative action phrase, prefer the action phrase

work page

[46] [46]

pick up" -

Prefer phrases like: - "pick up" - "pick bowl from drawer" - "pick up bowl" - "place bowl on plate" - "bowl in top drawer" - "black bowl"

work page

[47] [47]

place it on

Avoid: - fragmented phrases - fake combinations across unrelated spans - pronoun-centered phrases like "place it on" - low-information phrases 14 - too many static environment phrases - duplicated semantics across multiple keywords - more than 4 words in a keyword - less or more than 8 keywords

work page

[48] [48]

Do not explain anything

work page

[49] [49]

keywords

Return valid JSON only. Example: Instruction: pick up the green sponge from the sink and wipe the wooden table near the window Output: { "keywords": [ "pick and wipe", "pick sponge from sink", "pick up sponge", "green sponge", "wipe wooden table", "wipe table near window", "table near window", "wooden table" ] } MUST FOLLOW: - Do NOT less or more than 8 k...

work page