HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation

Haotian Luo; Heng Zhang; Jingyi Guo; Junyi Dong; Ping Luo; Qiuyu Feng; Shengwei Bian; Shunbo Zhou; Sitong Mao; Wenhao Chen

arxiv: 2605.26638 · v1 · pith:UY5EN7XQnew · submitted 2026-05-26 · 💻 cs.RO

HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation

Junyi Dong , Haotian Luo , Ziwei Xu , Shengwei Bian , Heng Zhang , Sitong Mao , Jingyi Guo , Yang Xu

show 6 more authors

Wenhao Chen Qiuyu Feng Yao Mu Ping Luo Shunbo Zhou Xiaodong Wu

This is my paper

Pith reviewed 2026-06-29 17:20 UTC · model grok-4.3

classification 💻 cs.RO

keywords sim-to-real transferrobotic manipulationadversarial trajectoriessynthetic data generationpolicy co-trainingdomain gaprobustness to perturbations

0 comments

The pith

HyperSim uses three pillars to reach 80-95 percent sim-to-real success in robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HyperSim as a full pipeline that starts with synthetic data and ends with policies that run on physical robots. It builds on three elements: making simulated scenes match real visuals and physics closely, generating trajectories that expose policy weaknesses, and training models on a mix of simulated and real data to learn features that stay consistent across domains. Tests across 400 real executions show the approach delivers 80 percent success with one policy model and 95 percent with another, while also raising robustness to physical changes by 35 percent. A reader would care because the method offers a scalable route to training capable robots without needing equally large amounts of expensive real-world data collection.

Core claim

HyperSim bridges the sim-to-real gap through high-fidelity environment synthesis to match visual details, adversarial trajectory generation to cover hard cases, and sim-and-real co-training to learn invariant features. Validated on 400 real executions, it delivers 80 percent success with ACT and 95 percent with π0 models, plus 35 percent better robustness to physical perturbations.

What carries the argument

The three core pillars of high-fidelity environment synthesis, adversarial trajectory generation, and sim-and-real co-training that together reduce visual, coverage, and representation gaps between simulation and reality.

If this is right

The full pipeline produces 80 percent success when transferring ACT policies to physical robots.
The full pipeline produces 95 percent success when transferring π0 policies to physical robots.
Policies trained on the generated adversarial trajectories complete tasks at a 35 percent higher rate when facing physical perturbations.
The combination of the three pillars reduces the effective domain gap enough for reliable deployment after limited real-world fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-pillar structure could be applied to other robot skills such as navigation or assembly if the synthesis and co-training steps generalize.
Future tests could check whether the co-training step still works with far fewer than 400 real executions to further cut data costs.
The robustness improvement might compound in settings with multiple simultaneous uncertainties like moving obstacles and sensor noise.

Load-bearing premise

The three pillars together close the domain gap for the tested tasks without introducing new failure modes that the 400 real executions do not capture.

What would settle it

Additional real-world trials on the same tasks but with new variations in lighting, object properties, or dynamics that produce success rates well below 80 percent or 95 percent would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.26638 by Haotian Luo, Heng Zhang, Jingyi Guo, Junyi Dong, Ping Luo, Qiuyu Feng, Shengwei Bian, Shunbo Zhou, Sitong Mao, Wenhao Chen, Xiaodong Wu, Yang Xu, Yao Mu, Ziwei Xu.

**Figure 1.** Figure 1: Requiring minimal human-collected data (a one-time environment scan and a few dozen demonstrations), our method reconstructs photorealistic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the HyperSim framework. HyperSim couples a standard data-to-policy pipeline with an enhancement layer to support physical deployment. By integrating high-fidelity environments, adversarial trajectories, and co-training, HyperSim systematically bridges the sim-to-real gap in visual fidelity, data coverage, and cross-domain feature representation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Spatial relation constraints used for foreground scene generation. 1 Huawei Proprietary - Restricted Distribution x y z [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The bottleneck pose and local coordinate frames. The gripper TCP enters a small hemisphere of radius d with respect to the target object. gripper controllers [12], [14]. An illustrative example of these segmented primitives is provided in Fig.5. 2) Adversarial Perturbation And Recovery: Building upon this piecewise formulation, we propose a mechanism to inject abrupt perturbations into the target’s state (… view at source ↗

**Figure 6.** Figure 6: Initialization of 20 real-world evaluation trials with varied target poses [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of target 2D pose distribution (a) and orientation distribution (b). The 2D poses in BaseSim are heavily concentrated at the workspace center, forming an elongated pattern. In contrast, ADSim expands these poses across the entire workspace to increase the spatial coverage. Similarly, the orientation distribution in BaseSim is skewed toward the [0, 180o ] interval, while ADSim provides a more uni… view at source ↗

**Figure 8.** Figure 8: Comparison of robot head-camera observations. Compared to the real-world observation (a), BaseSim and ADSim generate observations in a clean background (b), whereas 3DGS-ADSim (c) provides aligned observations with complex background details. priors inherent in large-scale pre-trained models act synergistically with our high-quality synthetic data, dramatically lowering the zero-shot sim-to-real barrier. … view at source ↗

read the original abstract

Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, transferring robotic manipulation policies from simulation to the real world (sim-to-real) remains a formidable challenge due to the domain gap. This paper presents HyperSim, a holistic framework spanning from synthetic data generation to policy training and seamless real-world deployment. To systematically bridge the sim-to-real gap, HyperSim is realized through three core pillars: high-fidelity environment synthesis, adversarial trajectory generation, and sim-and-real co-training. Collectively, these modules address domain discrepancies by enhancing visual fidelity, expanding data coverage, and enforcing domain-invariant representations. We rigorously validate HyperSim through a large-scale empirical study involving 400 real-world task executions across two representative manipulation models. Assessed across three fine-grained metrics, our complete pipeline achieves remarkable sim-to-real success rates of 80% and 95% with ACT and \pi_{0}, respectively. Furthermore, policies trained on our adversarial trajectories exhibit significantly enhanced robustness against dynamic uncertainties, achieving a 35% higher completion rate under physical perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a three-pillar sim-to-real pipeline and reports 80-95% real-robot success plus a 35% robustness gain, but the abstract gives no baselines so the actual contribution cannot be judged.

read the letter

The central claim is that high-fidelity synthesis, adversarial trajectories, and sim-real co-training together produce 80% success with ACT and 95% with π0 across 400 real executions, plus 35% better robustness under perturbations. Without any control numbers the claim is impossible to assess.

The work does put those three pieces into one named framework and runs a sizable hardware study on two different policies. Four hundred real trials is more physical validation than most sim-to-real papers supply, and the tasks are standard manipulation ones. That scale is the main concrete output.

The soft spot is exactly what the stress-test note flags: no baselines appear for the non-adversarial case, the non-co-trained case, or simpler data-augmentation baselines on the same tasks. The 35% figure is relative only, with no absolute rates, no variance, and no per-task breakdown. The abstract also does not say how the 400 executions were chosen or whether any new failure modes showed up only after the full pipeline. These omissions make it hard to know whether the three pillars are doing the work or whether the numbers simply reflect more total data.

This is for groups that need a practical checklist for moving manipulation policies from sim to hardware and are willing to add their own controls. The ideas are standard enough that a referee could check the missing comparisons in one round. I would send it to peer review rather than desk-reject because the real-robot scale is there and the gaps are fixable with straightforward additions.

Referee Report

3 major / 1 minor

Summary. The paper presents HyperSim, a holistic sim-to-real framework for robotic manipulation built on three pillars: high-fidelity environment synthesis, adversarial trajectory generation, and sim-and-real co-training. It claims that the complete pipeline achieves 80% success with ACT and 95% with π₀ across 400 real-world executions, plus a 35% robustness gain under perturbations, validated on three fine-grained metrics.

Significance. If the empirical claims hold with proper controls, the three-pillar design could meaningfully advance scalable sim-to-real transfer by jointly addressing visual fidelity, data coverage, and domain invariance. The large-scale real-world validation (400 executions) is a positive feature, but the absence of supporting details prevents assessing whether the gains are attributable to the proposed modules.

major comments (3)

[Abstract] Abstract: the reported 80%/95% success rates and 35% robustness improvement are presented as aggregate outcomes from 400 executions with no baselines (e.g., non-adversarial or non-co-trained variants), no per-task counts, no variance or statistical tests, and no description of execution sampling. This directly undermines evaluation of whether the three pillars close the domain gap.
[Abstract] Abstract: the claim that adversarial trajectories yield enhanced robustness lacks any comparison to the non-adversarial baseline on the same tasks and perturbations, making the 35% figure impossible to interpret as evidence for the second pillar.
[Abstract] Abstract: no discussion of failure modes or whether the full pipeline introduces new ones not captured in the 400 executions, which is load-bearing for the weakest assumption that the pillars are jointly sufficient without side effects.

minor comments (1)

[Abstract] The notation π₀ should be defined on first use or in a table of symbols for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the abstract's presentation of results could be strengthened. We address each major comment below. Where the abstract lacks explicit detail, we will revise it to improve self-containment while ensuring the body of the paper already supplies the supporting analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 80%/95% success rates and 35% robustness improvement are presented as aggregate outcomes from 400 executions with no baselines (e.g., non-adversarial or non-co-trained variants), no per-task counts, no variance or statistical tests, and no description of execution sampling. This directly undermines evaluation of whether the three pillars close the domain gap.

Authors: The abstract summarizes headline outcomes; the Experiments section and supplementary material contain the requested controls, including baseline variants without adversarial trajectories or co-training, per-task success counts, standard deviations across runs, and the randomized sampling protocol used for the 400 executions. To make these elements visible at the abstract level, we will add a concise clause referencing the controlled comparisons and statistical reporting. revision: yes
Referee: [Abstract] Abstract: the claim that adversarial trajectories yield enhanced robustness lacks any comparison to the non-adversarial baseline on the same tasks and perturbations, making the 35% figure impossible to interpret as evidence for the second pillar.

Authors: The 35% robustness gain is computed from paired experiments that directly compare policies trained with and without the adversarial trajectory module under identical perturbation conditions; these comparisons appear in Section 4.3. We will revise the abstract to explicitly state that the reported improvement is measured against the non-adversarial baseline on the same task set. revision: yes
Referee: [Abstract] Abstract: no discussion of failure modes or whether the full pipeline introduces new ones not captured in the 400 executions, which is load-bearing for the weakest assumption that the pillars are jointly sufficient without side effects.

Authors: Failure-mode analysis, including cases where the full pipeline does not improve or introduces new error patterns, is presented in the supplementary material and briefly summarized in the discussion section. The abstract will be updated to note that no novel failure modes attributable to the combined pipeline were observed beyond those already present in the individual components. revision: yes

Circularity Check

0 steps flagged

No circularity: claims are direct empirical measurements from real executions

full rationale

The paper reports success rates (80%/95%) and robustness gains (35%) as outcomes of 400 real-world task executions across two models. The abstract and provided text contain no equations, no fitted parameters renamed as predictions, no self-citations used to justify uniqueness or ansatzes, and no derivation chain that reduces to its own inputs. The three pillars are presented as engineering components whose joint effect is measured externally rather than derived by construction. This is a standard empirical validation case with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger records the domain assumption stated in the text; no free parameters or invented entities are named.

axioms (1)

domain assumption Domain discrepancies can be addressed by enhancing visual fidelity, expanding data coverage, and enforcing domain-invariant representations.
The abstract states that the three pillars collectively address domain discrepancies via these three mechanisms.

pith-pipeline@v0.9.1-grok · 5766 in / 1332 out tokens · 27169 ms · 2026-06-29T17:20:08.178126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 13 canonical work pages · 5 internal anchors

[1]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burch- fiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipula- tion,”arXiv preprint arXiv:2410.07864, 2024. TABLE V SPATIALRELATIONCONSTRAINTS Name Description scale(OBJ, RANGE) scale OBJ within the RANGE pose2D(OBJ, RANGE) randomize OBJ 2D pose within the RANGE pose3D(OBJ, RANGE) ran...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,” inRobotics: Science and Systems (RSS), 2024

2024
[4]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yanget al., “Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,” inRobotics: Science and Systems (RSS) 2025. Robotics: Science and Systems Foundation, 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p152.pdf

2025
[5]

Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation,

H. Fan, H. Dai, J. Zhang, J. Li, Q. Yan, Y . Zhao, M. Gao, J. Wu, H. Tang, and H. Dong, “Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2512.19390

work page arXiv 2025
[6]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Gen-0: Embodied foundation models that scale with physical interaction,

G. A. Team, “Gen-0: Embodied foundation models that scale with physical interaction,”Generalist AI Blog, 2025, https://generalistai.com/blog/nov-04-2025-GEN-0

2025
[8]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023

work page arXiv 2023
[9]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Scaling up and distilling down: Language-guided robot skill acquisition,

H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” 2023

2023
[11]

Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,

P. Hua, M. Liu, A. Macaluso, Y . Lin, W. Zhang, H. Xu, and L. Wang, “Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,” in8th Annual Conference on Robot Learning
[12]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Ascent: Autonomous skill learning toward complex embodied tasks with foundation models,

H. Wu, Y . Liu, J. Dong, H. Zhang, S. Mao, H. Wang, W. Wu, and S. Zhou, “Ascent: Autonomous skill learning toward complex embodied tasks with foundation models,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 16 752–16 758

2025
[14]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy

Y . Tian, Y . Yang, Y . Xie, Z. Cai, X. Shi, N. Gao, H. Liu, X. Jiang, Z. Qiu, F. Yuan, Y . Li, P. Wang, J. Cai, J. Zeng, H. Dong, and J. Pang, “Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy,” 2025. [Online]. Available: https://arxiv.org/abs/2511.16651

work page arXiv 2025
[15]

ACM Trans

B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, Jul. 2023. [Online]. Available: https://doi.org/10.1145/3592433

work page doi:10.1145/3592433 2023
[16]

In: ACM SIGGRAPH 2024 Conference Pa- pers

B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 Conference Papers, ser. SIGGRAPH ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3641519.3657428

work page doi:10.1145/3641519.3657428 2024
[17]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,

D. Chen, H. Li, W. Ye, Y . Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 9, pp. 6100–6111, 2025

2025
[18]

Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,

S. Yang, W. Yu, J. Zeng, J. Lv, K. Ren, C. Lu, D. Lin, and J. Pang, “Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13175

work page arXiv 2025
[19]

Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,

E. Johns, “Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 4613–4619

2021
[20]

Learning multi-stage tasks with one demon- stration via self-replay,

N. Di Palo and E. Johns, “Learning multi-stage tasks with one demon- stration via self-replay,” inConference on Robot Learning. PMLR, 2022, pp. 1180–1189

2022
[21]

Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels.arXiv preprint arXiv:2503.22634, 2025

A. Wei, A. Agarwal, B. Chen, R. Bosworth, N. Pfaff, and R. Tedrake, “Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22634

work page arXiv 2025
[22]

Sim-and-real co-training: A sim- ple recipe for vision-based robotic manipulation,

A. Maddukuri, Z. Jiang, L. Y . Chen, S. Nasiriany, Y . Xie, Y . Fang, W. Huang, Z. Wang, Z. Xu, N. Chernyadev, S. Reed, K. Goldberg, A. Mandlekar, L. Fan, and Y . Zhu, “Sim-and-real co-training: A sim- ple recipe for vision-based robotic manipulation,” inProceedings of Robotics: Science and Systems (RSS), Los Angeles, CA, USA, 2025

2025
[23]

Invariance co-training for robot visual generalization,

J. Yang, C. Finn, and D. Sadigh, “Invariance co-training for robot visual generalization,” 2025. [Online]. Available: https://arxiv.org/abs/2512.05230

work page arXiv 2025
[24]

Gpgs: Geometric priors for 3d gaussian splatting in structural environments,

Z. Xu, W. Chen, S. Wang, Z. Ouyang, S. Bian, and S. Zhou, “Gpgs: Geometric priors for 3d gaussian splatting in structural environments,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 15 695–15 702

2025
[25]

O3DE Documentation,

Open 3D Foundation, “O3DE Documentation,” 2021. [Online]. Available: https://docs.o3de.org

2021

[1] [1]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burch- fiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipula- tion,”arXiv preprint arXiv:2410.07864, 2024. TABLE V SPATIALRELATIONCONSTRAINTS Name Description scale(OBJ, RANGE) scale OBJ within the RANGE pose2D(OBJ, RANGE) randomize OBJ 2D pose within the RANGE pose3D(OBJ, RANGE) ran...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,” inRobotics: Science and Systems (RSS), 2024

2024

[4] [4]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yanget al., “Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,” inRobotics: Science and Systems (RSS) 2025. Robotics: Science and Systems Foundation, 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p152.pdf

2025

[5] [5]

Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation,

H. Fan, H. Dai, J. Zhang, J. Li, Q. Yan, Y . Zhao, M. Gao, J. Wu, H. Tang, and H. Dong, “Twinaligner: Visual-dynamic alignment empowers physics-aware real2sim2real for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2512.19390

work page arXiv 2025

[6] [6]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Gen-0: Embodied foundation models that scale with physical interaction,

G. A. Team, “Gen-0: Embodied foundation models that scale with physical interaction,”Generalist AI Blog, 2025, https://generalistai.com/blog/nov-04-2025-GEN-0

2025

[8] [8]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023

work page arXiv 2023

[9] [9]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Scaling up and distilling down: Language-guided robot skill acquisition,

H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” 2023

2023

[11] [11]

Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,

P. Hua, M. Liu, A. Macaluso, Y . Lin, W. Zhang, H. Xu, and L. Wang, “Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,” in8th Annual Conference on Robot Learning

[12] [12]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Ascent: Autonomous skill learning toward complex embodied tasks with foundation models,

H. Wu, Y . Liu, J. Dong, H. Zhang, S. Mao, H. Wang, W. Wu, and S. Zhou, “Ascent: Autonomous skill learning toward complex embodied tasks with foundation models,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 16 752–16 758

2025

[14] [14]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy

Y . Tian, Y . Yang, Y . Xie, Z. Cai, X. Shi, N. Gao, H. Liu, X. Jiang, Z. Qiu, F. Yuan, Y . Li, P. Wang, J. Cai, J. Zeng, H. Dong, and J. Pang, “Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy,” 2025. [Online]. Available: https://arxiv.org/abs/2511.16651

work page arXiv 2025

[15] [15]

ACM Trans

B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, Jul. 2023. [Online]. Available: https://doi.org/10.1145/3592433

work page doi:10.1145/3592433 2023

[16] [16]

In: ACM SIGGRAPH 2024 Conference Pa- pers

B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 Conference Papers, ser. SIGGRAPH ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3641519.3657428

work page doi:10.1145/3641519.3657428 2024

[17] [17]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,

D. Chen, H. Li, W. Ye, Y . Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 9, pp. 6100–6111, 2025

2025

[18] [18]

Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,

S. Yang, W. Yu, J. Zeng, J. Lv, K. Ren, C. Lu, D. Lin, and J. Pang, “Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13175

work page arXiv 2025

[19] [19]

Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,

E. Johns, “Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 4613–4619

2021

[20] [20]

Learning multi-stage tasks with one demon- stration via self-replay,

N. Di Palo and E. Johns, “Learning multi-stage tasks with one demon- stration via self-replay,” inConference on Robot Learning. PMLR, 2022, pp. 1180–1189

2022

[21] [21]

Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels.arXiv preprint arXiv:2503.22634, 2025

A. Wei, A. Agarwal, B. Chen, R. Bosworth, N. Pfaff, and R. Tedrake, “Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22634

work page arXiv 2025

[22] [22]

Sim-and-real co-training: A sim- ple recipe for vision-based robotic manipulation,

A. Maddukuri, Z. Jiang, L. Y . Chen, S. Nasiriany, Y . Xie, Y . Fang, W. Huang, Z. Wang, Z. Xu, N. Chernyadev, S. Reed, K. Goldberg, A. Mandlekar, L. Fan, and Y . Zhu, “Sim-and-real co-training: A sim- ple recipe for vision-based robotic manipulation,” inProceedings of Robotics: Science and Systems (RSS), Los Angeles, CA, USA, 2025

2025

[23] [23]

Invariance co-training for robot visual generalization,

J. Yang, C. Finn, and D. Sadigh, “Invariance co-training for robot visual generalization,” 2025. [Online]. Available: https://arxiv.org/abs/2512.05230

work page arXiv 2025

[24] [24]

Gpgs: Geometric priors for 3d gaussian splatting in structural environments,

Z. Xu, W. Chen, S. Wang, Z. Ouyang, S. Bian, and S. Zhou, “Gpgs: Geometric priors for 3d gaussian splatting in structural environments,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 15 695–15 702

2025

[25] [25]

O3DE Documentation,

Open 3D Foundation, “O3DE Documentation,” 2021. [Online]. Available: https://docs.o3de.org

2021