Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

Li Liu; Weijie Li; Xinhua Jiang; Yuenan Hou; Yu Li; Zijia Chen

arxiv: 2606.01047 · v1 · pith:KWJYRA6Qnew · submitted 2026-05-31 · 💻 cs.RO

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

Zijia Chen , Yuenan Hou , Xinhua Jiang , Yu Li , Weijie Li , Li Liu This is my paper

Pith reviewed 2026-06-28 17:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationmulti-modal policiesmixture of expertstrajectory predictiondata-efficient learningLIBERO benchmarkcross-modal routing

0 comments

The pith

MATE uses a cross-modal cosine router inside a Mixture-of-Experts model to stabilize assignment of visual, language, and trajectory tokens and raise success rates when demonstrations are scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MATE, a robotic policy that processes vision, language instructions, and trajectory data inside one Mixture-of-Experts network. Standard routing collapses or assigns experts unevenly when the three modalities produce mismatched feature scales, so the authors replace it with a cosine-similarity router, add temperature scaling, and inject noise during training. These changes keep expert usage balanced even with few demonstrations. The result is reported as a 4.75 percent gain in average success rate on the LIBERO benchmark over earlier trajectory-guided methods, plus usable trajectories in a real ping-pong robot task.

Core claim

MATE is a trajectory-prediction framework built on Mixture-of-Experts that introduces a Multi-Modal MoE layer for sub-token feature decoupling and replaces conventional routing with a cross-modal cosine router; temperature-controlled routing together with stochastic noise injection then keeps expert assignment stable and prevents collapse when only scarce demonstrations are available.

What carries the argument

The cross-modal cosine router, which computes routing weights from cosine similarity between modality features and expert embeddings to obtain scale-invariant, stable expert assignment across heterogeneous inputs.

If this is right

MATE records a 4.75 percent higher average success rate than prior trajectory-guided methods on the LIBERO benchmark under data scarcity.
The generated trajectories supply useful guidance for downstream real-robot execution, as shown in ping-pong manipulation trials.
Fine-grained sub-token decoupling inside the Multi-Modal MoE reduces interference among vision, language, and trajectory inputs.
Temperature scaling and noise injection maintain expert balance and avoid premature collapse when training data is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router could be dropped into other multi-modal sequence models that suffer from mismatched input scales, such as vision-language navigation policies.
Because assignment no longer depends on feature magnitude, the approach may reduce the amount of per-modality normalization required before training.
Extending the noise schedule or testing on additional real-world manipulation tasks would show whether the stability persists outside the reported ping-pong setting.

Load-bearing premise

Cross-modal feature discrepancies make ordinary dot-product routers unstable, and switching to cosine similarity plus temperature and noise will produce balanced assignment without creating new failure modes.

What would settle it

Re-running the LIBERO low-data experiments with the identical splits and finding that MATE achieves no higher success rate than the trajectory-guided baseline or that expert selection still collapses to one or two experts.

Figures

Figures reproduced from arXiv: 2606.01047 by Li Liu, Weijie Li, Xinhua Jiang, Yuenan Hou, Yu Li, Zijia Chen.

**Figure 2.** Figure 2: Two-stage trajectory-guided policy learning. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the standard Top-k Mixture-of-Experts architecture. A token is routed by the router to Top-k selected experts. The outputs from the selected experts are aggregated to produce the final output. 4 Method In this section, we describe the technical design of MATE under the trajectory-guided policy learning formulation. MATE enhances the Trajectory Transformer with routing-based expert specializ… view at source ↗

**Figure 4.** Figure 4: Overview of the MATE architecture. MATE consists of three key components: Multi [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Expert utilization comparison with and without temperature-noise routing stabilization. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Robotic manipulation requires the effective integration of heterogeneous inputs, including visual observations, language instructions, and trajectory representations, to generate accurate actions. Existing transformer-based policies typically process these heterogeneous modalities within a shared parameter space, which often leads to modality interference and inefficient representation learning, especially in data-scarce scenarios. While Mixture-of-Experts (MoE) offers a scalable solution through expert specialization, conventional routing mechanisms are often sensitive to such cross-modal representation discrepancies, resulting in unstable expert assignment and expert collapse. In this work, we propose MATE (Multi-ModAl TrajEctory Policies), a novel trajectory prediction framework built upon MoE. Specifically, we introduce a Multi-Modal MoE architecture to achieve fine-grained sub-token feature decoupling, and design a cross-modal cosine router for stable and scale-invariant expert assignment across heterogeneous modalities. We further employ temperature-controlled routing and stochastic noise injection to improve expert balance and prevent premature routing collapse under scarce demonstrations. Experiments on the LIBERO benchmark show that our MATE consistently outperforms prior work under data scarcity. It achieves a 4.75% improvement in average success rate over the trajectory-guided counterpart. Real-world experiments on robotic ping-pong also suggest that the predicted trajectories can provide useful guidance for downstream robotic execution, further indicating the practical feasibility of our algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MATE adds a cross-modal cosine router to MoE for multi-input robot policies, but the abstract gives no routing diagnostics so the 4.75% gain cannot be tied to that mechanism.

read the letter

The paper's core move is to replace a shared transformer with a Multi-Modal MoE that routes visual, language, and trajectory tokens separately, using cosine similarity for the router plus temperature scaling and noise to avoid collapse when demonstrations are few. That combination is the concrete novelty.

The approach makes sense on paper. Modality interference is a known pain point in these policies, and letting experts specialize while keeping routing stable across different input scales is a reasonable fix. The LIBERO result and the ping-pong transfer both point in the direction the authors want.

The soft spot is attribution. The abstract claims the router plus stabilization tricks produce the 4.75% lift over the trajectory-guided baseline, yet it reports nothing on expert utilization, routing entropy, or balance across modalities. Without those numbers or an ablation that isolates the router, the gain could come from the broader MoE structure or from other unmentioned changes. The experimental protocol itself is also thin—no mention of run count, error bars, or statistical tests.

This is for groups already working on data-scarce multi-modal manipulation who need a concrete router variant to try. A reader who wants to reproduce or extend the method will need the full paper's ablations and diagnostics before deciding the router is the load-bearing piece.

If the full manuscript supplies the missing routing metrics and a clean ablation, it is worth sending to review. On the abstract alone the central claim is under-supported.

Referee Report

2 major / 0 minor

Summary. The paper proposes MATE, a Multi-Modal MoE architecture for robotic manipulation policies that processes visual observations, language instructions, and trajectory representations. It introduces a cross-modal cosine router, temperature-controlled routing, and stochastic noise injection to achieve stable expert assignment and prevent collapse in data-scarce settings. On the LIBERO benchmark, MATE reports a 4.75% average success-rate improvement over a trajectory-guided counterpart, with additional real-world ping-pong experiments suggesting practical utility.

Significance. If the performance gains can be reliably attributed to the proposed routing mechanisms and replicated with standard experimental rigor, the work would offer a concrete approach to mitigating modality interference in multi-modal robotic policies under limited data. The combination of MoE specialization with cross-modal routing addresses a recognized challenge in data-efficient manipulation, but the absence of supporting diagnostics limits the ability to evaluate its incremental contribution.

major comments (2)

[Abstract] Abstract: The central empirical claim of a 4.75% success-rate improvement on LIBERO under data scarcity is stated without any information on the number of runs, statistical tests, error bars, exact baseline implementations, or data-scarcity protocol, rendering the result impossible to assess for reliability or reproducibility.
[Experiments] Experiments (implied by abstract claim): The performance gain is attributed to the cross-modal cosine router plus temperature/noise injection solving unstable assignment and collapse, yet no diagnostics are supplied on expert utilization, routing entropy, modality balance, or ablation isolating the router from the broader Multi-Modal MoE architecture; this directly undermines the causal link between the proposed mechanism and the headline result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting issues of reproducibility and causal attribution in our experimental claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of a 4.75% success-rate improvement on LIBERO under data scarcity is stated without any information on the number of runs, statistical tests, error bars, exact baseline implementations, or data-scarcity protocol, rendering the result impossible to assess for reliability or reproducibility.

Authors: We agree that the abstract as currently written lacks sufficient experimental context for independent assessment. In the revised manuscript we will expand the abstract to report results averaged over five random seeds, include standard error bars, specify the data-scarcity protocol (subsets of ten demonstrations per task), note that baselines follow the original implementations under identical scarcity conditions, and state that the reported gains pass paired t-tests at p < 0.05. revision: yes
Referee: [Experiments] Experiments (implied by abstract claim): The performance gain is attributed to the cross-modal cosine router plus temperature/noise injection solving unstable assignment and collapse, yet no diagnostics are supplied on expert utilization, routing entropy, modality balance, or ablation isolating the router from the broader Multi-Modal MoE architecture; this directly undermines the causal link between the proposed mechanism and the headline result.

Authors: We acknowledge that the submitted manuscript does not contain the requested diagnostics or isolating ablations. We will add a dedicated analysis subsection that reports expert utilization histograms, routing entropy curves, modality-balance metrics, and controlled ablations that remove the cosine router, temperature scaling, and noise injection in turn, thereby clarifying the incremental contribution of each proposed component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmark validation

full rationale

The manuscript is an empirical proposal introducing MATE, a Multi-Modal MoE policy with a cross-modal cosine router, temperature-controlled routing, and noise injection. No equations, derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. The headline result (4.75% success-rate gain on LIBERO under data scarcity) is presented as an experimental outcome measured against external baselines, not as a self-referential quantity. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core mechanisms. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of newly proposed architectural components (multi-modal MoE, cosine router, temperature routing, noise injection) whose benefits are asserted via benchmark results; no free parameters, mathematical axioms, or external evidence for the invented routing mechanisms are supplied in the abstract.

invented entities (1)

Multi-Modal MoE architecture with cross-modal cosine router no independent evidence
purpose: Achieve fine-grained sub-token feature decoupling and stable expert assignment across heterogeneous modalities
Newly introduced routing mechanism whose stability claims rest on the paper's own experiments.

pith-pipeline@v0.9.1-grok · 5774 in / 1325 out tokens · 30985 ms · 2026-06-28T17:24:50.401658+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Heterogeneous and Adept Snapshot Distillation for 3D Semantic Segmentation
cs.CV 2026-06 unverdicted novelty 4.0

HAS-KD combines information-oriented heterogeneous distillation from multi-modal models with adept snapshot distillation from training checkpoints to reach SOTA 3D semantic segmentation on ScanNetV2 and S3DIS without ...

Reference graph

Works this paper leans on

53 extracted references · 5 linked inside Pith · cited by 1 Pith paper

[1]

Siciliano and O

B. Siciliano and O. Khatib.Springer Handbook of Robotics. Springer, 2008

2008
[2]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research, 37(4-5):421–436, 2018

2018
[3]

Z. Li, A. Chapin, E. Xiang, R. Yang, B. Machado, N. Lei, E. Dellandrea, D. Huang, and L. Chen. Robotic manipulation via imitation learning: Taxonomy, evolution, benchmark, and challenges.arXiv preprint arXiv:2508.17449, 2025

arXiv 2025
[4]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems (RSS), 2025

2025
[5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π 0.5: a vision-language-action model with open-world gen- eralization. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pag...

2025
[6]

Y . Liu, W. Li, L. Liu, J. Zhou, B. Peng, Y . Song, X. Xiong, W. Yang, T. Liu, Z. Liu, and X. Li. ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

2026
[7]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. InternVLA- A1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

arXiv 2026
[8]

W. Li, W. Yang, Y . Hou, L. Liu, Y . Liu, and X. Li. SARATR-X: Toward building a foundation model for sar target recognition.IEEE Transactions on Image Processing, 34:869–884, 2025

2025
[9]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17(39):1–40, 2016

2016
[10]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imita- tion learning for complex manipulation tasks from virtual reality teleoperation. In2018 IEEE international conference on robotics and automation (ICRA), pages 5628–5635, 2018

2018
[11]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[12]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems (RSS), 2023

2023
[13]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model. InProceedings of the 40th International Confere...

2023
[14]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of The 7th Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, pages 2165–
[15]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011
[16]

Mandlekar, D

A. Mandlekar, D. Xu, R. Mart ´ın-Mart´ın, S. Savarese, and L. Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. InRobotics: Science and Systems (RSS), 2020

2020
[17]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In J. Lim, S. Song, and H.-W. Park, editors, Proceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 4493–4505. PMLR, 27–30 Sep 2025

2025
[18]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

2023
[19]

Ravichandar, A

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3 (1):297–330, 2020

2020
[20]

Y . Chen, Y . Zhang, G. D’urso, N. Lawrance, and B. Tidd. Improving generalization ability of robotic imitation learning by resolving causal confusion in observations.arXiv preprint arXiv:2507.22380, 2025

arXiv 2025
[21]

S. Bai, W. Song, J. Chen, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives.arXiv preprint arXiv:2512.22983, 2025

arXiv 2025
[22]

C. Wen, X. Lin, J. I. R. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024

2024
[23]

J. Yang, H. Zhu, Y . Wang, G. Wu, T. He, and L. Wang. Tra-moe: Learning trajectory pre- diction model from multiple domains for adaptive policy conditioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6960–
[24]

Mu and S

S. Mu and S. Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

arXiv 2025
[25]

N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, et al. GLaM: Efficient scaling of language models with mixture-of-experts. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning...

2022
[26]

Kaplan, S

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[27]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: A diffusion foundation model for bimanual manipulation. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 10

2025
[28]

Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, H.-S. Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023

arXiv 2023
[29]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer in lifelong robot learning. In37th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, December 2023

2023
[30]

Li and S

K. Li and S. Zhao. Learning visuomotor policy for multi-robot laser tag game.arXiv preprint arXiv:2603.11980, 2026

arXiv 2026
[31]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. R. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024

2024
[32]

Y . Hou, L. Liu, Q. Wei, X. Xu, and C. Chen. A novel ddpg method with prioritized experience replay. In2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 316–321. IEEE, 2017

2017
[33]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024

2024
[34]

O’Neill, A

Open X-Embodiment Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024

2024
[35]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, et al. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning (CoRL), volume 270 ofProceedings of Machine Learning Research, pages 2679–
[36]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi- task learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 5824–5836, 2020

2020
[37]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th Inter- national Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 19730–19742. PMLR, 2023

2023
[38]

Awadalla, I

A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y . Hanafy, W. Zhu, K. Marathe, Y . Bitton, S. Gadre, S. Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390, 2023

Pith/arXiv arXiv 2023
[39]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120): 1–39, 2022

2022
[40]

Nguyen, P

H. Nguyen, P. Akbarian, T. Pham, T. Nguyen, S. Zhang, and N. Ho. Statistical advantages of perturbing cosine router in mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025

2025
[41]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated Mixture-of-Experts layer. InInternational Conference on Learning Representations (ICLR), 2017. 11

2017
[42]

Lepikhin, H

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

Pith/arXiv arXiv 2006
[43]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[44]

Y . Li, Y . Hou, Y . Wei, X. Zhu, Y . Ma, W. Shao, and Y . Guo. MoE3D: Mixture of experts meets multi-modal 3d understanding.arXiv preprint arXiv:2511.22103, 2025

arXiv 2025
[45]

W. Li, W. Yang, T. Liu, Y . Hou, Y . Li, Z. Liu, Y . Liu, and L. Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture. ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024

2024
[46]

Huang, S

R. Huang, S. Zhu, Y . Du, and H. Zhao. Moe-loco: Mixture of experts for multitask locomotion. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 14218–14225. IEEE, 2025

2025
[47]

Rodriguez, C

A. Rodriguez, C. Li, L. Mazza, R. Younis, O. Hellig, S. Bodenstedt, M. Wagner, and S. Speidel. LAR-MoE: Latent-aligned routing for mixture of experts in robotic imitation learning.arXiv preprint arXiv:2603.08476, 2026

arXiv 2026
[48]

Mazza, A

L. Mazza, A. Rodriguez, R. Younis, M. Lelis, O. Hellig, C. Li, S. Bodenstedt, M. Wagner, and S. Speidel. Supervised Mixture-of-Experts for surgical grasping and retraction.arXiv preprint arXiv:2601.21971, 2026

Pith/arXiv arXiv 2026
[49]

B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

Pith/arXiv arXiv 2022
[50]

Puigcerver, C

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby. From sparse to soft mixtures of experts. InInternational Conference on Learning Representations (ICLR), 2024

2024
[51]

Nguyen, N

T. Nguyen, N. Tran, K. Nguyen, and R. Baraniuk. Improving routing in sparse mixture of experts with graph of tokens.arXiv preprint arXiv:2505.00792, 2025

arXiv 2025
[52]

Z. Su, Z. Lin, X. Bai, X. Wu, Y . Xiong, H. Lian, G. Ma, H. Chen, G. Ding, W. Zhou, and S. Hu. MaskMoE: Boosting token-level learning via routing mask in Mixture-of-Experts.arXiv preprint arXiv:2407.09816, 2024

arXiv 2024
[53]

F. Xue, Z. Zheng, Y . Fu, J. Ni, Z. Zheng, W. Zhou, and Y . You. OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024. 12

arXiv 2024

[1] [1]

Siciliano and O

B. Siciliano and O. Khatib.Springer Handbook of Robotics. Springer, 2008

2008

[2] [2]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research, 37(4-5):421–436, 2018

2018

[3] [3]

Z. Li, A. Chapin, E. Xiang, R. Yang, B. Machado, N. Lei, E. Dellandrea, D. Huang, and L. Chen. Robotic manipulation via imitation learning: Taxonomy, evolution, benchmark, and challenges.arXiv preprint arXiv:2508.17449, 2025

arXiv 2025

[4] [4]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InRobotics: Science and Systems (RSS), 2025

2025

[5] [5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π 0.5: a vision-language-action model with open-world gen- eralization. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pag...

2025

[6] [6]

Y . Liu, W. Li, L. Liu, J. Zhou, B. Peng, Y . Song, X. Xiong, W. Yang, T. Liu, Z. Liu, and X. Li. ATRNet-STAR: A large dataset and benchmark towards remote sensing object recognition in the wild.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

2026

[7] [7]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. InternVLA- A1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

arXiv 2026

[8] [8]

W. Li, W. Yang, Y . Hou, L. Liu, Y . Liu, and X. Li. SARATR-X: Toward building a foundation model for sar target recognition.IEEE Transactions on Image Processing, 34:869–884, 2025

2025

[9] [9]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17(39):1–40, 2016

2016

[10] [10]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imita- tion learning for complex manipulation tasks from virtual reality teleoperation. In2018 IEEE international conference on robotics and automation (ICRA), pages 5628–5635, 2018

2018

[11] [11]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[12] [12]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems (RSS), 2023

2023

[13] [13]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model. InProceedings of the 40th International Confere...

2023

[14] [14]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of The 7th Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, pages 2165–

[15] [15]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011

[16] [16]

Mandlekar, D

A. Mandlekar, D. Xu, R. Mart ´ın-Mart´ın, S. Savarese, and L. Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. InRobotics: Science and Systems (RSS), 2020

2020

[17] [17]

Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In J. Lim, S. Song, and H.-W. Park, editors, Proceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 4493–4505. PMLR, 27–30 Sep 2025

2025

[18] [18]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

2023

[19] [19]

Ravichandar, A

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3 (1):297–330, 2020

2020

[20] [20]

Y . Chen, Y . Zhang, G. D’urso, N. Lawrance, and B. Tidd. Improving generalization ability of robotic imitation learning by resolving causal confusion in observations.arXiv preprint arXiv:2507.22380, 2025

arXiv 2025

[21] [21]

S. Bai, W. Song, J. Chen, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives.arXiv preprint arXiv:2512.22983, 2025

arXiv 2025

[22] [22]

C. Wen, X. Lin, J. I. R. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024

2024

[23] [23]

J. Yang, H. Zhu, Y . Wang, G. Wu, T. He, and L. Wang. Tra-moe: Learning trajectory pre- diction model from multiple domains for adaptive policy conditioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6960–

[24] [24]

Mu and S

S. Mu and S. Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

arXiv 2025

[25] [25]

N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, et al. GLaM: Efficient scaling of language models with mixture-of-experts. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning...

2022

[26] [26]

Kaplan, S

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[27] [27]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: A diffusion foundation model for bimanual manipulation. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 10

2025

[28] [28]

Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, H.-S. Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023

arXiv 2023

[29] [29]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer in lifelong robot learning. In37th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, December 2023

2023

[30] [30]

Li and S

K. Li and S. Zhao. Learning visuomotor policy for multi-robot laser tag game.arXiv preprint arXiv:2603.11980, 2026

arXiv 2026

[31] [31]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. R. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024

2024

[32] [32]

Y . Hou, L. Liu, Q. Wei, X. Xu, and C. Chen. A novel ddpg method with prioritized experience replay. In2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 316–321. IEEE, 2017

2017

[33] [33]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024

2024

[34] [34]

O’Neill, A

Open X-Embodiment Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024

2024

[35] [35]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, et al. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning (CoRL), volume 270 ofProceedings of Machine Learning Research, pages 2679–

[36] [36]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi- task learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 5824–5836, 2020

2020

[37] [37]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th Inter- national Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 19730–19742. PMLR, 2023

2023

[38] [38]

Awadalla, I

A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y . Hanafy, W. Zhu, K. Marathe, Y . Bitton, S. Gadre, S. Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390, 2023

Pith/arXiv arXiv 2023

[39] [39]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120): 1–39, 2022

2022

[40] [40]

Nguyen, P

H. Nguyen, P. Akbarian, T. Pham, T. Nguyen, S. Zhang, and N. Ho. Statistical advantages of perturbing cosine router in mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025

2025

[41] [41]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated Mixture-of-Experts layer. InInternational Conference on Learning Representations (ICLR), 2017. 11

2017

[42] [42]

Lepikhin, H

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

Pith/arXiv arXiv 2006

[43] [43]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021

[44] [44]

Y . Li, Y . Hou, Y . Wei, X. Zhu, Y . Ma, W. Shao, and Y . Guo. MoE3D: Mixture of experts meets multi-modal 3d understanding.arXiv preprint arXiv:2511.22103, 2025

arXiv 2025

[45] [45]

W. Li, W. Yang, T. Liu, Y . Hou, Y . Li, Z. Liu, Y . Liu, and L. Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture. ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024

2024

[46] [46]

Huang, S

R. Huang, S. Zhu, Y . Du, and H. Zhao. Moe-loco: Mixture of experts for multitask locomotion. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 14218–14225. IEEE, 2025

2025

[47] [47]

Rodriguez, C

A. Rodriguez, C. Li, L. Mazza, R. Younis, O. Hellig, S. Bodenstedt, M. Wagner, and S. Speidel. LAR-MoE: Latent-aligned routing for mixture of experts in robotic imitation learning.arXiv preprint arXiv:2603.08476, 2026

arXiv 2026

[48] [48]

Mazza, A

L. Mazza, A. Rodriguez, R. Younis, M. Lelis, O. Hellig, C. Li, S. Bodenstedt, M. Wagner, and S. Speidel. Supervised Mixture-of-Experts for surgical grasping and retraction.arXiv preprint arXiv:2601.21971, 2026

Pith/arXiv arXiv 2026

[49] [49]

B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

Pith/arXiv arXiv 2022

[50] [50]

Puigcerver, C

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby. From sparse to soft mixtures of experts. InInternational Conference on Learning Representations (ICLR), 2024

2024

[51] [51]

Nguyen, N

T. Nguyen, N. Tran, K. Nguyen, and R. Baraniuk. Improving routing in sparse mixture of experts with graph of tokens.arXiv preprint arXiv:2505.00792, 2025

arXiv 2025

[52] [52]

Z. Su, Z. Lin, X. Bai, X. Wu, Y . Xiong, H. Lian, G. Ma, H. Chen, G. Ding, W. Zhou, and S. Hu. MaskMoE: Boosting token-level learning via routing mask in Mixture-of-Experts.arXiv preprint arXiv:2407.09816, 2024

arXiv 2024

[53] [53]

F. Xue, Z. Zheng, Y . Fu, J. Ni, Z. Zheng, W. Zhou, and Y . You. OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024. 12

arXiv 2024