pith. machine review for the scientific record. sign in

arxiv: 2604.13645 · v1 · submitted 2026-04-15 · 💻 cs.RO · cs.AI· cs.LG

Recognition: unknown

A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

Abhiram Maddukuri, Minghuan Liu, Yuke Zhu, Yu Lei, Zhenyu Jiang

Pith reviewed 2026-05-10 12:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords sim-and-real co-traininggenerative robot policiesrepresentation alignmentimportance reweightingrobot manipulationmechanistic analysisdomain adaptation
0
0 comments X

The pith

Structured representation alignment between simulation and real data primarily drives success in co-training generative robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why combining scarce real robot data with abundant simulation data succeeds in training generative policies for robots. It identifies two governing effects: a primary one called structured representation alignment that balances how much representations match across domains with how well domains can still be distinguished, and a secondary importance reweighting effect that modulates action importance by domain. This matters because co-training is widely used yet its mechanisms were not clear, so clarifying them allows more reliable training when real data is limited. Validation comes from a controlled toy model plus sim-and-sim and sim-and-real manipulation experiments. The analysis unifies existing techniques and yields a straightforward improvement method.

Core claim

The authors establish that sim-and-real co-training performance is governed by two intrinsic effects. Structured representation alignment, reflecting the balance between cross-domain representation alignment and domain discernibility, plays the primary role in downstream performance. The importance reweighting effect, which arises from domain-dependent modulation of action weighting, operates at a secondary level. These effects are validated with theoretical analysis, a controlled toy model, and extensive sim-and-sim and sim-and-real robot manipulation experiments. The framework offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently改善

What carries the argument

Structured representation alignment, the balance between cross-domain representation alignment and domain discernibility, which primarily determines downstream policy performance in co-training.

If this is right

  • Recent co-training techniques can be reinterpreted as operating through the primary alignment effect and the secondary reweighting effect.
  • Prioritizing balanced cross-domain alignment over pure domain mixing leads to stronger policy performance.
  • Modulating the secondary importance reweighting effect provides additional but smaller gains.
  • A simple adjustment method derived from the analysis improves results over prior co-training approaches in manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment balance may guide co-training when using cross-embodiment data instead of simulation.
  • Direct measurement of representation alignment could become a diagnostic tool for choosing surrogate datasets in robotics.
  • The framework suggests testing whether the primary effect scales to higher-dimensional control tasks beyond the studied manipulations.

Load-bearing premise

The controlled toy model and the chosen sim-and-sim and sim-and-real manipulation experiments isolate the two effects without confounding influences from policy architecture or task specifics.

What would settle it

An experiment in the toy model where structured representation alignment is deliberately disrupted while keeping importance reweighting fixed, yet co-training performance does not decline as predicted, would falsify the primary role of alignment.

Figures

Figures reproduced from arXiv: 2604.13645 by Abhiram Maddukuri, Minghuan Liu, Yuke Zhu, Yu Lei, Zhenyu Jiang.

Figure 1
Figure 1. Figure 1: (a) A workflow example of co-training systems for generative robot policies. We identify Structured Representation Alignment as the main intrinsic effect in co-training: it refers to both representation alignment and domain discernibility, which lay the foundation of action transfer and adaptation. (b) Representative observations: representation alignment can be implicitly learned with appropriate mixing r… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of controlled toy example. We co-train ∼30 and ∼3000 samples from target and source domain. Vertically in each column, the differences of prediction samples showcase the impact of representation alignment; horizontally in each row, the differences showcase the shift of importance reweighting effect controlled by mixing ratio w. In a special case, we can further have the following relation of … view at source ↗
Figure 4
Figure 4. Figure 4: L2 loss with sweeping mixing ratio and delta z dis￾tance, and ANOVA variance decomposition. When representa￾tions overlap (red line) or are disjoint (blue line), co-trained models are insensitive to mixing ratios. The importance reweighting effect (blue bar) can only explain 20% of the performance variance. experiments on real-world robotic manipulation tasks. 4. Sim-and-Real Co-Training for Manipulation T… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizations of the designed sim-and-sim and sim-and-real tasks. In the physics-only setting, we vary the physical parameters of objects, including mass, friction, and size, while keeping the appearance similar. Across all target environments, we tune the robot controller configurations to match those in the real world. gaps and construct three sim-and-sim co-training settings, i.e., visual-only, physics… view at source ↗
Figure 6
Figure 6. Figure 6: UMAP visualization of latent features. We visualize the features after the vision stem and after the encoder trunk fϕ. Red and blue dots represent the real and simulation features, respectively. We show the results for a specific mixing ratio in this figure. More details and results are provided in the Appendix D.2. man) is lower (e.g., ∼ 0.4), suggesting that the relationship may be non-linear or only par… view at source ↗
Figure 7
Figure 7. Figure 7: Correlation between representation alignment and success rate. We group the results of different tasks in each setting and compute the Spearman and Pearson correlation coefficients after normalization. Blue, yellow, and green-series dots represent the same task in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Success rate of sim-and-sim co-training with different domain gaps. Each data point is computed over three policy checkpoints, and each checkpoint is evaluated for 200 trials. The best performance is consistently achieved in the range of (0.016, 0.3). NutAssembly and MugCleanup. As we largely change the object’s physical parameters while keeping its visual appearance similar, it is harder for co-trained po… view at source ↗
Figure 9
Figure 9. Figure 9: Average performance on sim-and-sim experiments. We can find that alignment-based methods (+OT/+ADDA) get improvement in the balanced mixing ratio group, while discernibility-based methods (+CFG) get improvement mainly in the unbalanced mixing ratio group. More detailed results are available in the Appendix. D.1 Method Success Rate Avg NutAssembly MugCleanup MugHang Real-only 11/30 8/30 6/30 8.6/30 Mixing R… view at source ↗
Figure 10
Figure 10. Figure 10: Performance of different co-training methods on MugCleanup with w = 0.1 in sim-and-sim settings. The plots are similar for all tasks and balanced mixing ratios. results are shown in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: N/M scaling B.3. Empirical Estimation [Fact 1. Gaussian concentration in high-dimensional space]: for a standard Gaussian vector ϵ ∼ N (0, Id), we have ||ϵ|| = √ d(1 + O(1)) with high probability. So for each component in pr(xt), ps(xt), it concentrates on a set of thin spherical shells centered at αtxi with radius σt √ d. [Empirical evidence 1. Data points separation]: for most of the timesteps t that ar… view at source ↗
Figure 12
Figure 12. Figure 12: Empirical measurement about training sample overlapping using Bhattacharyya coefficient. We compute it in the same way as in Song et al. (2025), but include the distances between observation features as we are modeling the conditional distribution. [A Special Case Study]: Let’s assume an ideal case where the action chunks are uniformly distributed across different states on the trajectories, so the ratio … view at source ↗
Figure 13
Figure 13. Figure 13: Detailed results of sim-and-sim evaluations. D. Additional Experiments Results D.1. Detailed Results of Sim-and-sim Evaluation We provide the detailed sim-and-sim evaluation results of comparing different co-training techniques. In [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Balanced mixing ratios are robust to different policy architectures. Each data point is computed over three policy checkpoints, and each checkpoint is evaluated for 200 trials. The best performance is consistently achieved in the range of (0.016, 0.3). that the best performance consistently lies in a narrow range ([0.016, 0.3]), indicating robustness beyond the architecture used in the main paper. D.4. Co… view at source ↗
Figure 15
Figure 15. Figure 15: Experiments on different dataset sizes. Each data point is computed over three policy checkpoints, and each checkpoint is evaluated for 200 trials. The best performance is consistently achieved in the range of (wn, wq), which supports our guideline below. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Existence of three regimes in co-training. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: More UMAP visualizations of the observation representations. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: More UMAP visualizations of the deeper layer representations. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
read the original abstract

Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \textbf{``structured representation alignment"}, reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \textbf{``importance reweighting effect"}, arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that sim-and-real co-training performance in generative robot policies is governed by two intrinsic effects: structured representation alignment (a balance between cross-domain alignment and domain discernibility) as the primary driver of downstream performance, and importance reweighting (domain-dependent action weighting) as secondary. It supports this via theoretical analysis, controlled toy-model experiments, and sim-and-sim plus sim-and-real robot manipulation tasks, providing a unified interpretation of prior techniques and motivating a simple improvement method.

Significance. If the mechanistic claims hold, the work offers a principled framework for understanding and improving co-training, which is widely used but poorly understood in robot learning. The dual theoretical-empirical approach and the concrete improvement method are practical strengths that could reduce reliance on ad-hoc tuning.

major comments (2)
  1. [§4] §4 (toy model): The claim that structured representation alignment is primary requires explicit isolation from importance reweighting. The experiments must show that alignment metrics explain substantially more performance variance than reweighting (e.g., via ablation or regression); without this, the primary/secondary distinction risks being confounded by generative sampling dynamics.
  2. [§5] §5 (sim-and-real manipulation experiments): Attribution of effects to policy performance would be strengthened by a variance decomposition or controlled ablation that holds reweighting fixed while modulating alignment (or vice versa). Current setups may entangle representation learning with action distribution changes, weakening causal claims about the two effects.
minor comments (2)
  1. [§3] Clarify notation for 'domain discernibility' and 'structured representation alignment' early in the theoretical section to distinguish them from standard contrastive or mutual-information metrics.
  2. [Figure 3] Figure captions for the toy-model results should explicitly state the controlled variables and any post-hoc data exclusion rules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting opportunities to strengthen the isolation of effects in our analysis. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (toy model): The claim that structured representation alignment is primary requires explicit isolation from importance reweighting. The experiments must show that alignment metrics explain substantially more performance variance than reweighting (e.g., via ablation or regression); without this, the primary/secondary distinction risks being confounded by generative sampling dynamics.

    Authors: In the toy-model experiments of §4, we constructed the generative process such that the latent representation parameters (controlling cross-domain alignment and domain discernibility) are varied independently while the action distributions are held identical across domains. This design eliminates domain-dependent action reweighting by construction, allowing performance differences to be attributed to structured representation alignment. To make the primary/secondary distinction more rigorous, we will add in the revision (i) a set of ablations that explicitly fix reweighting and modulate alignment, and (ii) a regression analysis reporting the fraction of performance variance explained by alignment metrics versus reweighting terms. revision: yes

  2. Referee: [§5] §5 (sim-and-real manipulation experiments): Attribution of effects to policy performance would be strengthened by a variance decomposition or controlled ablation that holds reweighting fixed while modulating alignment (or vice versa). Current setups may entangle representation learning with action distribution changes, weakening causal claims about the two effects.

    Authors: We acknowledge that the sim-and-real tasks involve some coupling between representation learning and action distributions. Our sim-and-sim experiments already provide a more separable setting in which the two effects can be modulated independently. For the sim-and-real results, we will introduce additional controlled ablations in the revised manuscript: we will hold action reweighting fixed by employing domain-invariant action sampling and vary alignment through targeted regularization on the representation space. We will also report a variance decomposition that quantifies the relative contribution of each effect to downstream success rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis derives effects from theory then validates independently on held-out experiments

full rationale

The paper's chain begins with theoretical identification of two effects (structured representation alignment as primary, importance reweighting as secondary) from mechanistic analysis of co-training, followed by validation on a controlled toy model plus separate sim-and-sim and sim-and-real manipulation tasks. No quoted equations or claims show a 'prediction' or derived quantity reducing by construction to the same fitted inputs or data used to define the effects. Self-citations, if present, are not load-bearing for the central claims, and experiments are described as isolating the effects via controlled variation. The derivation remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the toy model faithfully captures the representation and weighting dynamics of real generative policies; no free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5505 in / 1096 out tokens · 28023 ms · 2026-05-10T12:25:29.264269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 34 canonical work pages · 9 internal anchors

  1. [1]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

    Barreiros, J., Beaulieu, A., Bhat, A., Cory, R., Cousineau, E., Dai, H., Fang, C.-H., Hashimoto, K., Irshad, M. Z., Itkina, M., et al. A careful examination of large behav- ior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331,

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734,

  3. [3]

    arXiv preprint arXiv:2511.15704 (2025)

    Cai, X., Qiu, R.-Z., Chen, G., Wei, L., Liu, I., Huang, T., Cheng, X., and Wang, X. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data. arXiv preprint arXiv:2511.15704,

  4. [4]

    Generalizable domain adaptation for sim-and- real policy co-training

    Cheng, S., Ma, L., Chen, Z., Mandlekar, A., Garrett, C., and Xu, D. Generalizable domain adaptation for sim-and- real policy co-training. arXiv preprint arXiv:2509.18631,

  5. [5]

    Doshi, H

    Doshi, R., Walke, H., Mees, O., Dasari, S., and Levine, S. Scaling cross-embodied learning: One policy for ma- nipulation, navigation, locomotion and aviation. arXiv preprint arXiv:2408.11812,

  6. [6]

    P., and Salimans, T

    Gao, R., Hoogeboom, E., Heek, J., De Bortoli, V ., Murphy, K. P., and Salimans, T. Diffusion models and gaussian flow matching: Two sides of the same coin. InThe Fourth Blogpost Track at ICLR 2025,

  7. [7]

    S., Sartoretti, G., and Schwager, M

    He, C., Liu, X., Camps, G. S., Sartoretti, G., and Schwager, M. Demystifying diffusion policies: Action memoriza- tion and simple lookup table alternatives. arXiv preprint arXiv:2505.05787,

  8. [8]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  9. [9]

    Po- laRiS: Scalable real-to-sim evaluations for generalist robot policies, 2025

    Jain, A., Zhang, M., Arora, K., Chen, W., Torne, M., Irshad, M. Z., Zakharov, S., Wang, Y ., Levine, S., Finn, C., et al. Polaris: Scalable real-to-sim evaluations for generalist robot policies. arXiv preprint arXiv:2512.16881,

  10. [10]

    J., and Zhu, Y

    Jiang, Z., Xie, Y ., Lin, K., Xu, Z., Wan, W., Mandlekar, A., Fan, L. J., and Zhu, Y . Dexmimicgen: Automated data generation for bimanual dexterous manipulation via im- itation learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 16923–16930. IEEE,

  11. [11]

    Emergence of Human to Robot Trans- fer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025

    Kareer, S., Patel, D., Punamiya, R., Mathur, P., Cheng, S., Wang, C., Hoffman, J., and Xu, D. Egomimic: Scaling imitation learning via egocentric video. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 13226–13233. IEEE, 2025a. 11 How Does Sim-and-Real Co-Training Work in Generative Robot Policies? Kareer, S., Pertsch, K., Darp...

  12. [12]

    Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

    Lepert, M., Fang, J., and Bohg, J. Masquerade: Learning from in-the-wild human videos using data-editing. arXiv preprint arXiv:2508.09976, 2025a. Lepert, M., Fang, J., and Bohg, J. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025b. Li, S., Gao, Y ., Sadigh, D., and Song, S. Unified video action model. ...

  13. [13]

    Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

    Liang, J., Tokmakov, P., Liu, R., Sudhakar, S., Shah, P., Ambrus, R., and V ondrick, C. Video generators are robot policies. arXiv preprint arXiv:2508.00795,

  14. [14]

    A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

    Lin, F., Arora, K., Mercat, J., Nishimura, H., Shah, P., Xu, C., Zhang, M., Zolotas, M., Angeles, M., Pfannenstiehl, O., et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067,

  15. [15]

    Constraint-preserving data gen- eration for visuomotor policy learning

    Lin, K., Ragunath, V ., McAlinden, A., Prasad, A., Wu, J., Zhu, Y ., and Bohg, J. Constraint-preserving data gen- eration for visuomotor policy learning. arXiv preprint arXiv:2508.03944,

  16. [16]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  17. [17]

    Being-h0: Vision-language-action pretraining from large-scale human videos, 2025

    Luo, H., Feng, Y ., Zhang, W., Zheng, S., Wang, Y ., Yuan, H., Liu, J., Xu, C., Jin, Q., and Lu, Z. Being-h0: vision-language-action pretraining from large-scale hu- man videos. arXiv preprint arXiv:2507.15597,

  18. [18]

    Y ., Nasiriany, S., Xie, Y ., Fang, Y ., Huang, W., Wang, Z., Xu, Z., Chernyadev, N., et al

    Maddukuri, A., Jiang, Z., Chen, L. Y ., Nasiriany, S., Xie, Y ., Fang, Y ., Huang, W., Wang, Z., Xu, Z., Chernyadev, N., et al. Sim-and-real co-training: A simple recipe for vision-based robotic manipulation. arXiv preprint arXiv:2503.24361,

  19. [19]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations, 2023

    Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y ., Fan, L., Zhu, Y ., and Fox, D. Mimicgen: A data gen- eration system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596,

  20. [20]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, L., Healy, J., and Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426,

  21. [21]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mittal, M., Roth, P., Tigue, J., Richard, A., Zhang, O., Du, P., Serrano-Mu˜noz, A., Yao, X., Zurbr¨ugg, R., Rudin, N., et al. Isaac lab: A gpu-accelerated simulation frame- work for multi-modal robot learning. arXiv preprint arXiv:2511.04831,

  22. [22]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., and Zhu, Y . Robocasa: Large- scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523,

  23. [23]

    6892–6903

    In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903. IEEE,

  24. [24]

    Much ado about noising: Dispelling the myths of generative robotic control,

    Pan, C., Anantharaman, G., Huang, N.-C., Jin, C., Pfrom- mer, D., Yuan, C., Permenter, F., Qu, G., Boffi, N., Shi, G., et al. Much ado about noising: Dispelling the myths of generative robotic control. arXiv preprint arXiv:2512.01809,

  25. [25]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054,

  26. [26]

    Selective underfitting in diffusion models.arXiv preprint arXiv:2510.01378,

    Song, K., Kim, J., Chen, S., Du, Y ., Kakade, S., and Sitz- mann, V . Selective underfitting in diffusion models.arXiv preprint arXiv:2510.01378,

  27. [27]

    Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels.arXiv preprint arXiv:2503.22634,

    Wei, A., Agarwal, A., Chen, B., Bosworth, R., Pfaff, N., and Tedrake, R. Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels.arXiv preprint arXiv:2503.22634,

  28. [28]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation, 2025

    Xu, M., Zhang, H., Hou, Y ., Xu, Z., Fan, L., Veloso, M., and Song, S. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. arXiv preprint arXiv:2505.21864,

  29. [29]

    Pushing the lim- its of cross-embodiment learning for manipulation and navigation

    Yang, J., Glossop, C., Bhorkar, A., Shah, D., Vuong, Q., Finn, C., Sadigh, D., and Levine, S. Pushing the lim- its of cross-embodiment learning for manipulation and navigation. arXiv preprint arXiv:2402.19432,

  30. [30]

    Latent Action Pretraining from Videos

    Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y .-W., Lin, B. Y ., et al. Latent action pretraining from videos. arXiv preprint arXiv:2410.11758,

  31. [31]

    Real2render2real: Scaling robot data without dynamics simulation or robot hardware,

    Yu, J., Fu, L., Huang, H., El-Refai, K., Ambrus, R. A., Cheng, R., Irshad, M. Z., and Goldberg, K. Real2render2real: Scaling robot data without dynam- ics simulation or robot hardware. arXiv preprint arXiv:2505.09601,

  32. [32]

    Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

    Yuan, C., Zhou, R., Liu, M., Hu, Y ., Wang, S., Yi, L., Wen, C., Zhang, S., and Gao, Y . Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies. arXiv preprint arXiv:2509.17759,

  33. [33]

    Mujoco playground,

    Zakka, K., Tabanpour, B., Liao, Q., Haiderbhai, M., Holt, S., Luo, J. Y ., Allshire, A., Frey, E., Sreenath, K., Kahrs, L. A., et al. Mujoco playground. arXiv preprint arXiv:2502.08844,

  34. [34]

    Zawalski, W

    Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., and Levine, S. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693,

  35. [35]

    DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

    Zhou, G., Pan, H., LeCun, Y ., and Pinto, L. Dino-wm: World models on pre-trained visual features enable zero- shot planning. arXiv preprint arXiv:2411.04983,

  36. [36]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Zhu, C., Yu, R., Feng, S., Burchfiel, B., Shah, P., and Gupta, A. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792,

  37. [37]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Zhu, Y ., Wong, J., Mandlekar, A., Mart´ın-Mart´ın, R., Joshi, A., Nasiriany, S., and Zhu, Y . robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293,

  38. [38]

    Simulation Pre-training+Real Fine-tine

    13 How Does Sim-and-Real Co-Training Work in Generative Robot Policies? Contents 1 Introduction 2 2 Theoretical Analysis of Co-Training 2 2.1 Structured Representation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Importance Reweighting Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  39. [39]

    and automated data generation tools (Mandlekar et al., 2023; Jiang et al., 2025; Lin et al., 2025), high-quality and massive robot trajectories can be obtained easily. Many works have demonstrated the effectiveness of sim-and-real co-training on challenging manipulation tasks with small diffusion policy models (Maddukuri et al., 2025; Wei et al., 2025; Ch...

  40. [40]

    There are also works (Barreiros et al., 2025; Jain et al.,

    and even large Vision-Language-Action(VLA) models (Bjorck et al., 2025; Yu et al., 2025). There are also works (Barreiros et al., 2025; Jain et al.,

  41. [41]

    Cross-Embodiment Co-Training.A large amount of work (Doshi et al., 2024; Yang et al., 2024; O’Neill et al., 2024; Physical Intelligence, 2025; Yuan et al.,

    utilizing relatively large-scale real-world data and a small amount of simulation data for co-training, so that the performance evaluated in simulation can effectively reflect the performance in the real world. Cross-Embodiment Co-Training.A large amount of work (Doshi et al., 2024; Yang et al., 2024; O’Neill et al., 2024; Physical Intelligence, 2025; Yua...

  42. [42]

    The main domain gaps include the visual appearance and the embodiment-physics gap

    has explored using cross-embodiment robot data for co-training, where a single policy is trained with multiple embodiments with a unified architecture and action representation. The main domain gaps include the visual appearance and the embodiment-physics gap. Human data is a special case of them, which can be directly treated as another embodiment. These...

  43. [43]

    Another line of work (Xu et al., 2025; Lepert et al., 2025b;a) forces input data distribution alignment via image editing

    utilize representation alignment regularization, such as optimal transport and adversarial discriminator. Another line of work (Xu et al., 2025; Lepert et al., 2025b;a) forces input data distribution alignment via image editing. Non-Robot Data Co-Training.Internet-scale multimodal data without action labels is another valuable co-training resource. Numero...

  44. [44]

    explicitly extract action labels but with the problem of accuracy; some works (Ye et al., 2024; Zhou et al.,

  45. [45]

    explored latent action representations by encoding changes between video frames; other works (Li et al., 2025; Zhu et al., 2025; Liang et al.,

  46. [46]

    propose to co-train the action- and actionless video data within a unified architecture. A.2. Representation Learning in Domain Adaptation Co-training can also be viewed as semi-supervised or unsupervised domain adaptation. A central theme in domain adaptation is learning representations that enable knowledge transfer across domains while mitigating distr...

  47. [47]

    These make the softmax weights extremely imbalanced, which is nearly biased towards the nearest training data. 17 How Does Sim-and-Real Co-Training Work in Generative Robot Policies? 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0Bhattacharyya coefficient Overlap measurement Real-Sim Real-Real Figure 12.Empirical measurement about training sample overla...