pith. sign in

arxiv: 2605.20856 · v1 · pith:MBT54SFTnew · submitted 2026-05-20 · 💻 cs.RO · cs.AI· cs.LG

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords policy generationhypernetworkslanguage-conditioned roboticsvisuomotor policiesdecoupling language and staterobot manipulationtask-specific policies
0
0 comments X

The pith

A hypernetwork generates complete task-specific robot policies from language instructions alone, so the resulting controller has no direct access to language and must encode task awareness in its parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard language-conditioned policies share network weights between instructions and observations, creating a pathway for visual shortcuts that ignore the instruction. DISC instead routes the instruction through a hypernetwork that outputs every weight of a separate visuomotor policy. The generated policy then executes using only observations, forcing any correct behavior to originate from parameters shaped by language. This separation yields higher success rates than entangled baselines on LIBERO-90 and Meta-World, with the gap widening on long-horizon tasks, and produces clear gains on real-world benchmarks where every task shares identical visual scenes.

Core claim

DISC replaces conditioning a shared policy on both language and observations with a hypernetwork that produces the full parameter set of a dedicated task-specific policy from the instruction alone. Because this generated policy never receives language input, its task-specific actions must derive from the parameters rather than from any observation-to-action mapping learned during training. A two-stage hypernetwork design incorporates the structure of gradient-based optimization as a feed-forward inductive bias to synthesize globally consistent high-dimensional weights without running actual optimization at inference time.

What carries the argument

A two-stage hypernetwork that maps an instruction to an initial set of policy parameters and then refines them through a feed-forward module whose structure mirrors gradient descent steps.

If this is right

  • Outperforms all language-conditioned baselines on LIBERO-90 and Meta-World, with larger margins on complex long-horizon tasks.
  • Surpasses a large-scale pretrained policy model without using any external pretraining data.
  • Delivers substantially higher success on real-world tasks sharing identical visual contexts, confirming that generated parameters rather than visual shortcuts drive the behavior.
  • Learns a semantically structured parameter manifold that supports few-shot adaptation from minimal demonstrations and robust performance under paraphrased instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The learned parameter manifold could support interpolation between tasks to create policies for novel instruction combinations without additional training.
  • The same generation approach might extend to other task specifications such as goal images or sketches.
  • Separating parameter generation from execution could reduce interference in other sequential control settings where multiple objectives must be satisfied from the same observations.

Load-bearing premise

The hypernetwork can produce coherent, high-dimensional policy parameters that correctly solve the instructed task when the generated policy receives only visual observations.

What would settle it

On a real-world test where every instruction is paired with the identical visual scene, the generated policies for different instructions produce indistinguishable behaviors despite the change in language.

Figures

Figures reproduced from arXiv: 2605.20856 by Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang.

Figure 1
Figure 1. Figure 1: Behavioral Evidence of Task-State Entanglement. Current policies trained with entangled architectures fail to ground language instructions faithfully. Left: given the instruction regarding “white bowl,” Octo instead approaches the microwave – executing a behavior associated with a different task that shares similar visual context. Right: given the multi-step instruction “turn on the stove and put the fryin… view at source ↗
Figure 2
Figure 2. Figure 2: The DISC architecture. The task specification, instantiated as a language instruction l, is encoded into embedding el and processed by the hypernetwork through two stages: (1) the Weight Initialization Network generates initial parameters θ (0) π (labeled as θ (0)), and (2) the learned iterative refinement module updates parameters over T steps to produce final θ (T) . The refinement module mimics optimiza… view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Combinatorial Benchmark Visualization. (a) Initial State: The setup presents inherent visual ambiguity: the target object (e.g., Red Apple) has multiple potential destinations (indicated by arrows). (b) Language-Conditioned Outcomes: Starting from the identical initial state shown in (a), DISC successfully executes three distinct tasks specified by different language instructions: (b1) placing t… view at source ↗
Figure 4
Figure 4. Figure 4: Few-Shot Adaptation Efficiency on LIBERO-Spatial. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of generated policy parameters. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full Visualization of Real-World Combinatorial Tasks. We evaluate DISC on all 9 combinations of the task matrix. Each row corresponds to a specific target object: Top Row: Red Apple; Middle Row: Green Apple; Bottom Row: Watermelon Slice. Column 1 (Initial State): Shows the cluttered starting configuration. Note that within each row, the visual observation is shared across all three downstream tasks. Column… view at source ↗
Figure 7
Figure 7. Figure 7: Attention Map Visualization for Task-State En￾tangled Architectures. We visualize attention maps for an Octo-style Transformer policy on two evaluation tasks. Despite specific language instructions (e.g., “Pick up the book”), the attention mechanisms frequently fail to localize the task￾relevant objects. Instead, attention is heavily concentrated on the robot’s manipulator arm (a constant visual feature) o… view at source ↗
read the original abstract

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $\pi_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DISC, a method for language-conditioned robotic manipulation that decouples instructions from state-conditioned control. Rather than using a shared network that processes both language and observations, DISC employs a hypernetwork to generate the full parameter set of a task-specific visuomotor policy from the instruction alone; the resulting policy receives only observations. A two-stage hypernetwork design is used, with a refinement stage that incorporates an inductive bias mimicking gradient-based optimization to produce coherent high-dimensional parameters. The paper claims this structural change eliminates observation leakage, and reports outperformance over entangled baselines on LIBERO-90, Meta-World, and a real-world benchmark with identical visual contexts across tasks, plus few-shot adaptation via a learned parameter manifold, all without external pretraining.

Significance. If the central claims hold, the work provides a structural alternative to data-driven or regularization-based approaches for avoiding shortcut learning in language-conditioned policies, which could improve reliability in long-horizon manipulation tasks. The real-world experiment with controlled visual context offers direct support for the decoupling hypothesis, and the reported outperformance of a large pretrained model (π0) without pretraining data is a notable empirical result. Reproducible code is provided, which aids verification. The significance hinges on whether the hypernetwork reliably generates globally consistent, task-solving parameters from instructions.

major comments (3)
  1. [§3] §3 (Method, two-stage hypernetwork description): The refinement stage is presented as embedding the structure of gradient-based optimization as a feed-forward inductive bias to generate globally consistent parameters. However, the manuscript does not provide explicit equations or a detailed derivation showing how the feed-forward layers map to optimization steps (e.g., no equivalent to an unrolled gradient update), making it unclear whether this produces coherent high-dimensional policy weights or merely local approximations that could fail to encode task behavior reliably.
  2. [Results section] Results section (LIBERO-90 and Meta-World evaluations): The abstract states outperformance with widening advantages on complex tasks, but the reported results lack error bars, ablation details on the refinement stage, and explicit data exclusion rules. This weakens verification that improvements arise from the structural decoupling rather than implementation specifics, directly affecting the load-bearing claim that language-generated parameters drive behavior.
  3. [Real-world benchmark section] Real-world benchmark section: While the identical-visual-context setup is a strong test for leakage, the quantitative results (e.g., success rates) are summarized without variance measures or statistical tests. This makes it difficult to assess the robustness of the claim that 'language-generated policy parameters, not visual shortcuts, drive behavior.'
minor comments (2)
  1. [Abstract] Abstract: The statement that 'observation leakage has no pathway to emerge' is absolute; consider qualifying it as 'substantially reduces pathways for leakage' given that empirical validation is still required.
  2. [Notation] Notation throughout: Ensure that symbols for generated policy parameters (e.g., θ) and hypernetwork outputs are defined consistently in the first use and used uniformly in equations and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify and strengthen the presentation of our work. We address each major comment below and have revised the manuscript accordingly to improve clarity, reproducibility, and statistical rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Method, two-stage hypernetwork description): The refinement stage is presented as embedding the structure of gradient-based optimization as a feed-forward inductive bias to generate globally consistent parameters. However, the manuscript does not provide explicit equations or a detailed derivation showing how the feed-forward layers map to optimization steps (e.g., no equivalent to an unrolled gradient update), making it unclear whether this produces coherent high-dimensional policy weights or merely local approximations that could fail to encode task behavior reliably.

    Authors: We agree that additional mathematical detail would strengthen the explanation of the refinement stage. In the revised manuscript, we have expanded §3 with explicit equations and a short derivation showing how the feed-forward refinement layers approximate a single step of gradient-based parameter optimization (including the mapping from layer operations to an implicit update rule). This makes the inductive bias more transparent while preserving the feed-forward nature of the architecture. revision: yes

  2. Referee: [Results section] Results section (LIBERO-90 and Meta-World evaluations): The abstract states outperformance with widening advantages on complex tasks, but the reported results lack error bars, ablation details on the refinement stage, and explicit data exclusion rules. This weakens verification that improvements arise from the structural decoupling rather than implementation specifics, directly affecting the load-bearing claim that language-generated parameters drive behavior.

    Authors: We acknowledge these omissions reduce verifiability. The revised results section now reports mean success rates with standard deviation error bars computed over 5 random seeds for all LIBERO-90 and Meta-World tables and figures. We have added a dedicated ablation table isolating the contribution of the refinement stage. We have also clarified in the experimental protocol that no tasks or episodes were excluded from the standard benchmark splits and that all reported numbers follow the official evaluation protocols without post-hoc filtering. revision: yes

  3. Referee: [Real-world benchmark section] Real-world benchmark section: While the identical-visual-context setup is a strong test for leakage, the quantitative results (e.g., success rates) are summarized without variance measures or statistical tests. This makes it difficult to assess the robustness of the claim that 'language-generated policy parameters, not visual shortcuts, drive behavior.'

    Authors: We agree that variance and significance testing would better support the real-world claims. In the revised real-world section we now include per-task success rates with standard deviations across 10 independent rollouts per task, together with a paired t-test (p < 0.05) comparing DISC against each entangled baseline. These additions directly quantify the robustness of the observed performance gap under identical visual contexts. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural decoupling supported by empirical benchmarks

full rationale

The paper's core contribution is an architectural proposal: a hypernetwork generates the full parameter set of a task-specific visuomotor policy from the instruction alone, so the resulting policy receives only observations and never directly accesses language. This structural separation is presented as eliminating observation leakage by design, with the two-stage refinement stage described as embedding gradient-based optimization structure as a feed-forward bias. No equations or derivations are shown that reduce a claimed prediction or first-principles result to the inputs by construction. Performance claims rest on empirical comparisons against entangled baselines on LIBERO-90, Meta-World, and a real-world identical-visual-context benchmark, plus few-shot adaptation results. No load-bearing self-citations, fitted-input predictions, or ansatz smuggling appear in the provided text; the method is self-contained against external benchmarks and does not rely on tautological renaming or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified capacity of the two-stage hypernetwork to map instructions to usable policy parameters; no free parameters or external benchmarks are specified in the abstract.

axioms (1)
  • domain assumption A hypernetwork can map language instructions to coherent high-dimensional visuomotor policy parameters
    Invoked when the paper states that the generated policy must derive its task-awareness solely from the language-encoded weights.
invented entities (1)
  • Two-stage hypernetwork with gradient-optimization-mimicking refinement stage no independent evidence
    purpose: To produce globally consistent policy weights from instructions without running actual gradient steps
    Introduced to solve the challenge of generating high-dimensional parameters; no independent evidence outside the method is provided in the abstract.

pith-pipeline@v0.9.0 · 5801 in / 1396 out tokens · 47240 ms · 2026-05-21T04:55:17.026886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 14 internal anchors

  1. [1]

    Vision-language models struggle to align entities across modalities

    I ˜nigo Alonso, Gorka Azkune, Ander Salaberria, Jeremy Barnes, and Oier Lopez de Lacalle. Vision-language models struggle to align entities across modalities. In Findings of the Association for Computational Linguis- tics: ACL 2025, pages 18846–18862, Vienna, Austria, July 2025. Association for Computational Linguistics. URL https://aclanthology.org/2025....

  2. [2]

    Hypernetworks in meta-reinforcement learning

    Jacob Beck, Matthew Thomas Jackson, Risto Vuorio, and Shimon Whiteson. Hypernetworks in meta-reinforcement learning. InConference on Robot Learning, pages 1478–

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi 0 : A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  5. [5]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  7. [7]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics, pages 4171–4186. Association for Computa- tional Linguistics, 2019. URL https://aclan...

  8. [8]

    Model- agnostic meta-learning for fast adaptation of deep net- works

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep net- works. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

  9. [9]

    HyperNetworks

    David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016

  10. [10]

    Baku: An efficient transformer for multi-task policy learning

    Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning. Advances in Neural Information Processing Systems, 37: 141208–141239, 2024

  11. [11]

    Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

  12. [12]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  13. [13]

    Otter: A vision-language-action model with text-aware visual feature extraction

    Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, and Pieter Abbeel. Otter: A vision-language-action model with text-aware feature extraction.arXiv preprint arXiv:2503.03734, 2025

  14. [14]

    Continual model-based reinforcement learning with hypernetworks

    Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, and Florian Shkurti. Continual model-based reinforcement learning with hypernetworks. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 799–805. IEEE, 2021

  15. [15]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  16. [16]

    3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

  17. [17]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  18. [18]

    Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

    Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

  19. [19]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

  20. [20]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  21. [21]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  22. [22]

    Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771, 2025

    Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, and Hanghang Tong. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in VLMs.arXiv preprint arXiv:2510.17771, 2025

  23. [23]

    Visu- alizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visu- alizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

  24. [24]

    Text takes over: A study of modality bias in multimodal intent detection

    Ankan Mullick, Saransh Sharma, Abhik Jana, and Pawan Goyal. Text takes over: A study of modality bias in multimodal intent detection. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24040–24070, Miami, Florida, USA, November 2025. Association for Computational Lin- guistics. URL https://aclanthology.org/2025...

  25. [25]

    Gaze-vlm: Bridging gaze and vlms through attention regularization for ego- centric understanding.arXiv preprint arXiv:2510.21356, 2025

    Anupam Pani and Yanchao Yang. Gaze-vlm: Bridging gaze and vlms through attention regularization for ego- centric understanding.arXiv preprint arXiv:2510.21356, 2025

  26. [26]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  27. [27]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139, pages 8748–876...

  29. [29]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  30. [30]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022

  31. [31]

    Hypogen: Optimization-biased hypernetworks for generalizable policy generation

    Hanxiang Ren, Li Sun, Xulong Wang, Pei Zhou, Zewen Wu, Siyan Dong, Difan Zou, Youyi Zheng, and Yanchao Yang. Hypogen: Optimization-biased hypernetworks for generalizable policy generation. InThe Thirteenth Inter- national Conference on Learning Representations, 2025

  32. [32]

    Goal-conditioned imitation learning using score-based diffusion policies.arXiv preprint arXiv:2304.02532, 2023

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning us- ing score-based diffusion policies.arXiv preprint arXiv:2304.02532, 2023

  33. [33]

    Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024

    Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024

  34. [34]

    Hyper- networks for zero-shot transfer in reinforcement learning

    Sahand Rezaei-Shoshtari, Charlotte Morissette, Fran- cois R Hogan, Gregory Dudek, and David Meger. Hyper- networks for zero-shot transfer in reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9579–9587, 2023

  35. [35]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  36. [36]

    Meta-learning with memory-augmented neural networks

    Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. InInterna- tional conference on machine learning, pages 1842–

  37. [37]

    Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

    Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

  38. [38]

    Prototyp- ical networks for few-shot learning.Advances in neural information processing systems, 30, 2017

    Jake Snell, Kevin Swersky, and Richard Zemel. Prototyp- ical networks for few-shot learning.Advances in neural information processing systems, 30, 2017

  39. [39]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  40. [40]

    Jiaqi Tang, Yinsong Xu, Yang Liu, and Qingchao Chen. Shaping initial state prevents modality competition in multi-modal fusion: A two-stage scheduling framework via fast partial information decomposition.arXiv preprint arXiv:2509.20840, 2025

  41. [41]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  42. [42]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  43. [43]

    Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

  44. [44]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

  45. [45]

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language- action model dreamed with comprehensive world knowl- edge.arXiv preprint arXiv:2507.04447, 2025

  46. [46]

    Maxmi: A maximal mutual information criterion for manipulation concept discovery

    Pei Zhou and Yanchao Yang. Maxmi: A maximal mutual information criterion for manipulation concept discovery. InEuropean Conference on Computer Vision, pages 88–

  47. [47]

    Autocgp: Closed-loop concept- guided policies from unlabeled demonstrations

    Pei Zhou, Ruizhe Liu, Qian Luo, Fan Wang, Yibing Song, and Yanchao Yang. Autocgp: Closed-loop concept- guided policies from unlabeled demonstrations. InThe Thirteenth International Conference on Learning Repre- sentations, 2025

  48. [48]

    Hyper-goalnet: Goal-conditioned manip- ulation policy learning with hypernetworks.Advances in Neural Information Processing Systems, 38:83438– 83469, 2026

    Pei Zhou, Wanting Yao, Qian Luo, Xunzhe Zhou, and Yanchao Yang. Hyper-goalnet: Goal-conditioned manip- ulation policy learning with hypernetworks.Advances in Neural Information Processing Systems, 38:83438– 83469, 2026

  49. [49]

    Rt-2: Vision-language- action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIXA PROBLEMFORMULATIONDETAILS A. State and Action Spaces The state spaceS ⊆R H×...

  50. [50]

    Generate Once, Act Many

    For backward pass simulation,F Backward estimates Ja- cobians usingJ θi =CrossAttn(τ i−1,ω i,ω i)andJ hi = CrossAttn(ωi,τ i−1,τ i−1)for inter-layer dependencies. Chain rule computations are implemented via attention-based matrix multiplications:∂L/∂z i−1 =CrossAttn(∂L/∂z i, Jhi , Jhi), ensuring modality consistency by using upstream gradients as queries w...

  51. [51]

    The robot successfully grasps the correct target object specified in the instruction

  52. [52]

    Pick up thebook

    The object is transported to and released inside the correct target container. We evaluate each method over 30 trials per task, totaling30× 9 = 270evaluation episodes per method. APPENDIXK EXTENDEDQUALITATIVEANALYSIS To provide concrete evidence for our claims regarding the limitations of task-state entanglement, we conduct qualitative analyses on a repre...