pith. sign in

arxiv: 2606.07217 · v1 · pith:RTJMVY4Snew · submitted 2026-06-05 · 💻 cs.RO · cs.CV· cs.LG

Robotic Policy Adaptation via Weight-Space Meta-Learning

Pith reviewed 2026-06-27 21:41 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG
keywords robotic manipulationvision-language-actionmeta-learningLoRA adaptationpolicy adaptationweight-space learning
0
0 comments X

The pith

WIZARD predicts task-specific LoRA weights for frozen vision-language-action policies from a language instruction and short video in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WIZARD as a meta-learning method that trains a model to output low-rank adaptation weights for a fixed robotic policy. The input is only a language command plus a brief demonstration video, and the output is the exact weight updates needed for that task. Training happens on a collection of seen tasks so the system learns how evidence about one task relates to the right adaptation for another. If the mapping holds, new tasks can be handled at deployment time with no action labels collected and no extra optimization run.

Core claim

WIZARD learns during meta-training to map task evidence consisting of language instructions and short videos directly to expert LoRA updates, enabling the prediction of task-specific adaptation weights for a frozen VLA policy in one forward pass for unseen tasks.

What carries the argument

Weight-space meta-learner that predicts LoRA parameters from task evidence.

If this is right

  • Performance improves by up to 2x on unseen dataset collections and up to 14x on unseen tasks in the LIBERO benchmark.
  • Generated adapters outperform a real-domain adapted baseline when tested on a physical Franka Emika Panda robot.
  • Adaptation occurs without collecting target-task action labels or performing test-time optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evidence-to-weight mapping could be tested on other parameter-efficient tuning methods beyond LoRA.
  • If the meta-learner generalizes across robot embodiments, it might reduce the data needed when moving policies between hardware platforms.
  • Task relationships captured in weight space could support rapid switching among many tasks without storing separate adapted policies.

Load-bearing premise

Meta-training on seen tasks produces a mapping from task evidence to expert LoRA updates that generalizes to entirely unseen tasks and real-robot conditions without any further optimization or labels.

What would settle it

Run WIZARD on a set of tasks whose language and visual features have no overlap with the meta-training distribution and measure whether success rates remain higher than the unadapted baseline policy.

Figures

Figures reproduced from arXiv: 2606.07217 by Alessio Sampieri, Andrea Roberti, Christian Bianchi, Fabio Galasso, Luca Franco, Luca Rigazio, Siamak Yousefi.

Figure 1
Figure 1. Figure 1: WIZARD: Weight-space Inference for Zero-shot Adaptation from Robotic Demon￾stration. (Left) Meta-Training. A repository of task experts is built from LIBERO-Goal, -Object, and -10 datasets. For each task, a LoRA adapter (∆Wi ) is trained while a multimodal encoder extracts a task embedding z i from the instruction and visual demonstration. The meta-network learns to map embeddings to LoRA parameters by rec… view at source ↗
Figure 2
Figure 2. Figure 2: shows a real-world banana rollout, from approach to grasp [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world setup. 4.4 Analysis and discussion Beyond zero-shot execution, we test whether generated adapters reduce supervision and accelerate fine-tuning when adaptation is needed. Data efficiency. We compare WIZARD with MT-VLA policies fine-tuned with increasing numbers of task-specific demonstrations on LIBERO-Spatial Task 1. As shown in Fig. 4a, WIZARD reaches 90% success without task-specific gradient… view at source ↗
Figure 4
Figure 4. Figure 4: Adaptation efficiency. (a) WIZARD matches 25-demo MT-VLA. (b) Generated weights warm-start fine-tuning and reach expert performance faster. 5 Conclusions We introduced WIZARD, a weight-space meta-learning framework that generates LoRA adapters for frozen VLA policies from language and video evidence. WIZARD enables zero-shot adaptation without action labels or test-time optimization, showing strong general… view at source ↗
Figure 5
Figure 5. Figure 5: Structure of the conditioning and latent spaces. The t-SNE visualizations of (left) task embeddings zi across all LIBERO suite, (middle) embeddings for LIBERO-Spatial tasks, and (right) the latent representation at the final layer of the meta-network. The plots show that task embeddings form distinct task-level clusters and are transformed into a structured representation in weight space. Dataset embedding… view at source ↗
Figure 6
Figure 6. Figure 6: Additional zero-shot qualitative rollouts on LIBERO-Spatial. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional zero-shot qualitative rollouts on LIBERO-Spatial. t = 0.0s t = 2.6s t = 5.2s t = 7.9s t = 10.5s t = 13.2s (a) Task 1: pick up the alphabet soup and place it in the basket. t = 0.0s t = 2.8s t = 5.6s t = 8.4s t = 11.2s t = 14.0s (b) Task 2: pick up the cream cheese and place it in the basket. t = 0.0s t = 2.6s t = 5.2s t = 7.8s t = 10.4s t = 13.0s (c) Task 5: pick up the ketchup and place it in t… view at source ↗
Figure 8
Figure 8. Figure 8: Additional zero-shot qualitative rollouts on LIBERO-Object. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional zero-shot qualitative rollouts on LIBERO-Goal. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional zero-shot qualitative rollouts on LIBERO-10. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional zero-shot qualitative rollouts on LIBERO-10. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative rollouts in Real-World. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Handheld input device used for robot teleoperation during data acquisition. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WIZARD, a weight-space meta-learning approach for Vision-Language-Action (VLA) policies. It trains a meta-learner to map task evidence (language instruction plus short demonstration video) directly to expert LoRA adaptation weights for a frozen base policy. At inference, adaptation occurs in a single forward pass with no target-task action labels and no test-time optimization. Experiments on the LIBERO benchmark are reported to yield up to ~2× gains on unseen dataset collections and ~14× on unseen tasks; real-robot results on a Franka Emika Panda are claimed to outperform a real-domain adapted baseline.

Significance. If the reported generalization holds, the method would materially lower the data and compute cost of deploying VLA policies on new tasks. The core technical idea—learning an explicit mapping from task evidence to weight-space updates rather than performing per-task optimization—is a clear departure from standard fine-tuning or test-time adaptation pipelines and could influence future meta-learning work in robotics.

major comments (3)
  1. [Abstract] Abstract: the central generalization claim (single-forward-pass adaptation to entirely unseen tasks and real-robot conditions) is supported only by aggregate performance multipliers (~2× and ~14×) with no accompanying experimental protocol, baseline definitions, number of trials, statistical tests, or error bars. This leaves the load-bearing claim that the meta-trained mapping extrapolates beyond the training distribution without further evidence.
  2. [Experiments (LIBERO results)] The weakest assumption identified in the stress-test note is not addressed: the paper must demonstrate that the LIBERO 'unseen' task splits are distributionally disjoint from meta-training tasks with respect to object categories, skill primitives, and scene structure; otherwise the reported gains may reflect interpolation rather than the claimed extrapolation in weight space.
  3. [Real-robot experiments] Real-robot section: the claim that generated adapters provide task-level specialization beyond simulation requires explicit comparison of the distribution of visual and proprioceptive evidence between simulation meta-training and real-robot test conditions; without this, the transfer result cannot be isolated from possible domain-gap artifacts.
minor comments (2)
  1. [Abstract] Abstract: the multipliers '~2×' and '~14×' are presented without stating the absolute success rates or the precise baseline policies against which they are measured; these quantities should appear in the main experimental tables.
  2. [Method] Notation: the mapping f(evidence) → ΔLoRA is described at a high level but the precise input representation (video encoding, language embedding) and output parameterization (which LoRA layers, rank, scaling) are not formalized in an equation; adding a compact definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, agreeing where additional evidence or clarification is warranted and outlining the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central generalization claim (single-forward-pass adaptation to entirely unseen tasks and real-robot conditions) is supported only by aggregate performance multipliers (~2× and ~14×) with no accompanying experimental protocol, baseline definitions, number of trials, statistical tests, or error bars. This leaves the load-bearing claim that the meta-trained mapping extrapolates beyond the training distribution without further evidence.

    Authors: We agree that the abstract, due to length constraints, presents only aggregate multipliers without protocol details. In the revision we will update the abstract to reference the experimental settings (e.g., number of trials per task, multiple random seeds for error bars, and baseline definitions) and add explicit cross-references to Section 4 and the appendix, where full protocols, statistical tests, and per-task results already appear. This strengthens the presentation of the generalization claim without changing its substance. revision: yes

  2. Referee: [Experiments (LIBERO results)] The weakest assumption identified in the stress-test note is not addressed: the paper must demonstrate that the LIBERO 'unseen' task splits are distributionally disjoint from meta-training tasks with respect to object categories, skill primitives, and scene structure; otherwise the reported gains may reflect interpolation rather than the claimed extrapolation in weight space.

    Authors: This is a fair and important observation. The original manuscript relies on the official LIBERO unseen splits without an explicit distributional analysis. We will add this analysis in the revised experiments section, including quantitative comparisons of object category overlap, skill primitive distributions (via available annotations), and scene structure embeddings between meta-training and unseen tasks. The results of this analysis will be reported transparently to support or qualify the extrapolation interpretation. revision: yes

  3. Referee: [Real-robot experiments] Real-robot section: the claim that generated adapters provide task-level specialization beyond simulation requires explicit comparison of the distribution of visual and proprioceptive evidence between simulation meta-training and real-robot test conditions; without this, the transfer result cannot be isolated from possible domain-gap artifacts.

    Authors: We acknowledge that an explicit distributional comparison is needed to isolate task specialization from domain-gap effects. In the revision we will add quantitative and visual comparisons (feature histograms, statistical distances on visual embeddings, and proprioceptive statistics) between the simulation meta-training distribution and the real-robot test conditions. This will be placed in the real-robot experiments section or appendix to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: standard meta-learning setup with independent generalization claims

full rationale

The paper trains a predictor during meta-training to map task evidence (language + video) to expert LoRA updates on seen tasks, then evaluates single-forward-pass inference on held-out unseen tasks and real-robot conditions. No equations, fitted parameters, or self-citations are shown that reduce the reported gains to inputs by construction. This is ordinary supervised meta-learning whose validity rests on empirical splits rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the untested premise that meta-training on seen tasks yields a generalizable weight-space mapping; no free parameters or invented physical entities are visible in the abstract.

axioms (1)
  • domain assumption Meta-training on seen tasks produces a mapping from task evidence to expert LoRA updates that generalizes to unseen tasks without further optimization.
    This premise is required for the single-forward-pass claim to hold on new tasks.
invented entities (1)
  • WIZARD meta-learner no independent evidence
    purpose: Predicts task-specific LoRA parameters from language and video evidence.
    New method introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.1-grok · 5746 in / 1337 out tokens · 20019 ms · 2026-06-27T21:41:47.133078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 25 canonical work pages · 8 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

  2. [2]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. H. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Van- houcke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. R. Florence. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning,

  3. [3]

    URLhttps://api.semanticscholar.org/CorpusID:257364842

  4. [4]

    J. E. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685, 2021. URL https://api. semanticscholar.org/CorpusID:235458009

  5. [5]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2026. URL https://arxiv.org/abs/2510.03827

  6. [6]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.CoRL, 2024

  7. [7]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control, 2026. URL https://arxiv.org...

  8. [8]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  9. [9]

    Liang, T

    Y . Liang, T. Xu, K. Hu, G. Jiang, F. Huang, and H. Xu. Make-an-agent: A generalizable policy network generator with behavior-prompted diffusion.ArXiv, abs/2407.10973, 2024. URL https://api.semanticscholar.org/CorpusID:271212603

  10. [10]

    Hegde, S

    S. Hegde, S. Das, G. Salhotra, and G. S. Sukhatme. Warpd: World model assisted reactive policy diffusion. 2024. URLhttps://api.semanticscholar.org/CorpusID:278960144

  11. [11]

    P. Zhou, W. Yao, Q. Luo, X. Zhou, and Y . Yang. Hyper-goalnet: Goal-conditioned manipulation policy learning with hypernetworks, 2025. URLhttps://arxiv.org/abs/2512.00085

  12. [12]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URLhttps://arxiv.org/abs/2306.03310

  13. [13]

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent, 2022. URL https://arxiv.org/abs/2205.06175. 9

  14. [14]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, K. Choromanski, T. Ding, D. Driess, K. A. Dubey, C. Finn, P. R. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, I. Leal, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, ...

  15. [15]

    URLhttps://api.semanticscholar.org/CorpusID:260293142

  16. [16]

    Kalashnikov, J

    D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021. URLhttps://arxiv.org/abs/2104.08212

  17. [17]

    Bousmalis, G

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauzá, T. Davchev, Y . Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V . Dalibard, M. Zambelli, M. F. Martins, R. Pevce- viciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. Zolna, S. E. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. O. Sushkov...

  18. [18]

    arXiv preprint arXiv:2210.03094 , year=

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. J. Fan. Vima: General robot manipulation with multimodal prompts.ArXiv, abs/2210.03094,

  19. [19]

    URLhttps://api.semanticscholar.org/CorpusID:252735175

  20. [20]

    Zhang and A

    X. Zhang and A. Boularias. One-shot imitation learning with invariance matching for robotic manipulation, 2024. URLhttps://arxiv.org/abs/2405.13178

  21. [21]

    Achille, M

    A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Per- ona. Task2vec: Task embedding for meta-learning.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6429–6438, 2019. URLhttps://api.semanticscholar. org/CorpusID:60440365

  22. [22]

    James, M

    S. James, M. Bloesch, and A. J. Davison. Task-embedded control networks for few-shot imitation learning.Conference on Robot Learning (CoRL), 2018

  23. [23]

    B. Li, Y . Wang, J. Gu, K.-W. Chang, and N. Peng. Metal: A multi-agent framework for chart generation with test-time scaling.ArXiv, abs/2502.17651, 2025. URL https://api. semanticscholar.org/CorpusID:276580569

  24. [24]

    C. Li, Z. Yang, H. Zhang, F. Chen, C. Zhu, A. Bolimera, and M. Savvides. Metavla: Unified meta co-training for efficient embodied adaption, 2026. URL https://arxiv.org/abs/ 2510.05580

  25. [25]

    Kumar, Z

    A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. 2021

  26. [26]

    Y . Guo, Y . Hu, J. Zhang, Y .-J. Wang, X. Chen, C. Lu, and J. Chen. Prediction with action: Visual policy learning via joint denoising process.ArXiv, abs/2411.18179, 2024. URL https: //api.semanticscholar.org/CorpusID:274306259

  27. [27]

    J. Beck, M. Jackson, R. Vuorio, and S. Whiteson. Hypernetworks in meta-reinforcement learning. InConference on Robot Learning, 2022. URL https://api.semanticscholar. org/CorpusID:253018758

  28. [28]

    J. Ba, G. E. Hinton, V . Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. InNeural Information Processing Systems, 2016. URL https://api. semanticscholar.org/CorpusID:568305. 10

  29. [29]

    Generative NeuroEvolution for Deep Learning

    P. Verbancsics and J. Harguess. Generative neuroevolution for deep learning, 2013. URL https://arxiv.org/abs/1312.5355

  30. [30]

    D. Ha, A. Dai, and Q. V . Le. Hypernetworks, 2016. URL https://arxiv.org/abs/1609. 09106

  31. [31]

    Brock, T

    A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Smash: One-shot model architecture search through hypernetworks.International Conference on Learning Representations, 2018. URL https://api.semanticscholar.org/CorpusID:3489117

  32. [32]

    Knyazev, M

    B. Knyazev, M. Drozdzal, G. W. Taylor, and A. Romero-Soriano. Parameter predic- tion for unseen deep architectures.ArXiv, abs/2110.13100, 2021. URL https://api. semanticscholar.org/CorpusID:239768239

  33. [33]

    Schürholt, B

    K. Schürholt, B. Knyazev, X. G. i Nieto, and D. Borth. Hyper-representations as generative models: Sampling unseen neural network weights.ArXiv, abs/2209.14733, 2022. URL https://api.semanticscholar.org/CorpusID:252595700

  34. [34]

    Peebles, I

    W. Peebles, I. Radosavovic, T. Brooks, A. A. Efros, and J. Malik. Learning to learn with generative models of neural network checkpoints, 2022. URL https://arxiv.org/abs/ 2209.12892

  35. [35]

    B. Soro, B. Andreis, H. Lee, S. Chong, F. Hutter, and S. J. Hwang. Diffusion-based neural network weights generation.ArXiv, abs/2402.18153, 2024. URL https://api. semanticscholar.org/CorpusID:268041405

  36. [36]

    K. Wang, D. Tang, B. Zeng, Y . Yin, Z. Xu, Y . Zhou, Z. Zang, T. Darrell, Z. Liu, and Y . You. Neural network diffusion, 2024

  37. [37]

    X. Jin, K. Wang, D. Tang, W. Zhao, Y . Zhou, J. Tang, and Y . You. Conditional lora parameter generation.ArXiv, abs/2408.01415, 2024. URL https://api.semanticscholar.org/ CorpusID:271693672

  38. [38]

    K. Wang, D. Tang, W. Zhao, K. Schürholt, Z. Wang, and Y . You. Recurrent diffusion for large-scale parameter generation, 2025. URLhttps://arxiv.org/abs/2501.11587

  39. [39]

    Liang, D

    Z. Liang, D. Tang, Y . Zhou, X. Zhao, M. Shi, W. Zhao, Z. Li, P. Wang, K. Schürholt, D. Borth, M. M. Bronstein, Y . You, Z. Wang, and K. Wang. Drag-and-drop llms: Zero-shot prompt-to- weights, 2025. URLhttps://arxiv.org/abs/2506.16406

  40. [40]

    Charakorn, E

    R. Charakorn, E. Cetin, Y . Tang, and R. T. Lange. Text-to-lora: Instant transformer adaption,

  41. [41]

    URLhttps://arxiv.org/abs/2506.06105

  42. [42]

    Z. Zhang. A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

  43. [43]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  44. [44]

    Siciliano, L

    B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo.Robotics: Modelling, Planning and Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2010. ISBN 9781846286414

  45. [45]

    He and S

    Y . He and S. Liu. Analytical inverse kinematics for Franka Emika Panda – a geometrical solver for 7-DOF manipulators with unconventional design. In2021 9th International Conference on Control, Mechatronics and Automation (ICCMA2021). IEEE, Nov. 2021. doi:10.1109/ ICCMA54375.2021.9646185. 12 Appendix This appendix provides supplementary material supportin...

  46. [46]

    " 2:if".action_in_proj

    This aligned feature map is then temporally tiled across 168 distinct time steps, yielding an input tensor of shape168×16×512. Decomposed 3D convolutional blocks:The core of the decoder utilizes custom 3D convolutional layers. Rather than computing a standard 3D convolution, which is prohibitively expensive and prone to overfitting, we decompose the opera...