arxiv: 2604.25459 · v1 · submitted 2026-04-28 · 💻 cs.RO

Recognition: unknown

GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

Anqi Wang, Bin Xie, Bokui Chen, Chenyu Cao, Dixuan Jiang, Guangyu Wang, Guyue Zhou, Haizhou Ge, Hanqing Cui, Hanyang Shao, Heng Zhang, Hongyun Tian, Jiacheng Wang, Jiayuan Zhang, Jinzhi Zhang, Junzhe Wu, Lei Han, Lei Hao, Lu Shi, Mingrui Yu, Ruqi Huang, Shenyu Chen, Tiancai Wang, Tianle Liu, Wei Sui, Xiwa Deng, Xun Yang, Xuran Yao, Yiyi Yan, Yizhou Jiang, Yuchi Zhang, Yue Li, Yufei Jia, Yusen Qin, Yuxiang Chen, Zhanxiang Cao, Zhenbiao Huang, Zheng Li, Zhixing Chen, Zhuoyuan Yu, Zifan Wang, Ziheng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:07 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot simulation3D Gaussian Splattingvision-based reinforcement learningReal2Simembodied AIphotorealistic renderingparallel physicssim-to-real transfer

0 comments

The pith

GS-Playground reaches 10,000 frames per second of photorealistic rendering inside a parallel physics simulator for vision-based robot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GS-Playground as a simulator that pairs a custom parallel physics engine with batch 3D Gaussian Splatting rendering to produce high-speed, high-fidelity visual observations for embodied AI tasks. It also supplies an automated Real2Sim pipeline that converts real scenes into simulation assets that remain both visually detailed and physically consistent without manual modeling. This combination targets the main barrier that has kept vision-centric methods from scaling like proprioception-only locomotion research: the heavy cost of photorealistic rendering at large batch sizes. If the integration holds, training loops for navigation, locomotion, and contact-rich manipulation can run at previously inaccessible scales while still supporting policy transfer to real hardware.

Core claim

GS-Playground is a multi-modal framework whose parallel physics engine integrates directly with a batch 3D Gaussian Splatting rendering pipeline to deliver synchronized, photorealistic images at 10^4 FPS at 640x480 resolution. An automated Real2Sim workflow reconstructs complex environments that are simultaneously photorealistic, physically consistent, and memory-efficient. Experiments across locomotion, navigation, and manipulation tasks show that policies trained in the system close the perceptual and physical gaps that have limited prior simulators.

What carries the argument

Batch 3D Gaussian Splatting rendering pipeline integrated with a custom parallel physics engine, which produces synchronized visual and physical states at high throughput while preserving fidelity.

Load-bearing premise

The batch rendering pipeline and physics engine stay synchronized at full speed without measurable loss of visual or physical accuracy, and the automated Real2Sim scenes are consistent enough for policies to transfer successfully to real robots.

What would settle it

Measure end-to-end training throughput and sim-to-real success rate for a contact-rich manipulation policy; if FPS falls below 2000 or transfer success is no better than prior simulators under identical compute, the central performance claim does not hold.

Figures

Figures reproduced from arXiv: 2604.25459 by Anqi Wang, Bin Xie, Bokui Chen, Chenyu Cao, Dixuan Jiang, Guangyu Wang, Guyue Zhou, Haizhou Ge, Hanqing Cui, Hanyang Shao, Heng Zhang, Hongyun Tian, Jiacheng Wang, Jiayuan Zhang, Jinzhi Zhang, Junzhe Wu, Lei Han, Lei Hao, Lu Shi, Mingrui Yu, Ruqi Huang, Shenyu Chen, Tiancai Wang, Tianle Liu, Wei Sui, Xiwa Deng, Xun Yang, Xuran Yao, Yiyi Yan, Yizhou Jiang, Yuchi Zhang, Yue Li, Yufei Jia, Yusen Qin, Yuxiang Chen, Zhanxiang Cao, Zhenbiao Huang, Zheng Li, Zhixing Chen, Zhuoyuan Yu, Zifan Wang, Ziheng Zhang.

**Figure 1.** Figure 1: GS-Playground Overview. It integrates photorealistic 3D Gaussian Splatting with high-performance parallel physics, achieving over 104 FPS at 640 × 480 resolution on a single GPU. We provide comprehensive sensor suites (Contact, Vision, LiDAR) and support a wide range of robotic embodiments and learning tasks, including locomotion, navigation, and manipulation. Abstract—Embodied AI research is undergoing a … view at source ↗

**Figure 2.** Figure 2: GS-Playground System Architecture. Left: an automated Image-to-Physics pipeline that constructs simulation-ready assets from RGB inputs via object segmentation, background inpainting, and 3DGS/mesh reconstruction. Middle: a physics and rendering simulation core with CPU/GPU physics backends, integrated sensor and LiDAR simulation, and batch-optimized 3DGS rendering with pruning and rigid-link kinematics. R… view at source ↗

**Figure 3.** Figure 3: Physics stability under complex multi-body interactions. (a) The dense store shelf scenario; (b) Stability error across time steps, computed over all objects in the scene with identical initial placements. The error is defined as p ∆p 2 + ∆θ 2, where ∆p is mean positional drift (m) and ∆θ is mean orientation drift (rad). 1 5 10 50 Scene Complexity (N) 10 1 10 2 10 3 10 4 FPS (Log Scale) (a) Complexity (Bat… view at source ↗

**Figure 4.** Figure 4: Performance Comparison on Complexity Scaling. Left: As the complexity (N, the number of humanoid robots) in a single environment increases, the FPS advantage of our framework becomes increasingly pronounced. Right: At a complexity of N = 10, our framework maintains FPS advantage with large batch sizes, where Genesis fails to reach convergence. numerical instability through Jacobian-related errors. When com… view at source ↗

**Figure 5.** Figure 5: Rendering throughput comparison between GS-Playground and Isaac Sim’s ray-tracing renderer across varying resolutions. While Isaac Sim relies on manual asset modeling and encounters Out-Of-Memory (OOM) exceptions at higher resolutions, GS-Playground leverages automated asset generation from real-world captures, achieving high-fidelity sim-ready assets with superior rendering throughput. Evaluations are con… view at source ↗

**Figure 7.** Figure 7: Wall-clock training efficiency for Unitree Go1 locomotion. (a) Flat terrain; (b) Rough terrain (stairs). “deci” denotes the decimation, which refers to the number of physical sub-steps per control step. Lower decimation typically increases throughput but may compromise physical fidelity. visual consistency with real camera images, preserving critical geometric features and surface details. Moreover, the s… view at source ↗

**Figure 6.** Figure 6: Visualization of rendering. The simulated renderings are nearly indistinguishable from real photographs indicating photorealistic fidelity. Our framework supports a broad range of tasks and evaluations, including diverse objects and scene configurations. framework consistently outperforms the baseline, maintaining significantly higher FPS across all tested batch sizes. This performance advantage is partic… view at source ↗

**Figure 8.** Figure 8: Real-world deployment of policies trained in GSPlayground. We demonstrate robust Sim2Real transfer across diverse embodiments and modalities: (a) Quadrupedal Locomotion: Velocity tracking on Unitree Go2; (b) Humanoid Locomotion: 23-DoF balancing and walking on Unitree G1; (c) Visual Manipulation: End-to-end RGB-based grasping; (d) Visual Navigation: Real-time cone following on Unitree Go2 using raw RGB o… view at source ↗

**Figure 9.** Figure 9: Shaking Test Scene: A Franka Panda robot grasps various objects (a cube, a ball, and a bottle) while being subjected to random shaking motions. This setup is used to evaluate the grasping robustness of different simulation methods under dynamic perturbations. 3D Gaussian Splatting (3DGS) representations, along with object-level meshes, 6D object poses, and calibrated camera intrinsics and extrinsics view at source ↗

**Figure 10.** Figure 10: Visual results of the asset generation pipeline on (top) Bridge-GS and (bottom) InteriorGS datasets. The rows display: (1) Original view at source ↗

**Figure 11.** Figure 11: We compare the success rates of different policies in simulation and the real world on various tasks. view at source ↗

**Figure 12.** Figure 12: Learning curves for the G1 Joystick task. view at source ↗

**Figure 13.** Figure 13: Learning curves for the Go2 Joystick task. view at source ↗

**Figure 14.** Figure 14: Perception setups for locomotion tasks. Left: The Unitree Go2 robot utilizing a Height Scan sensor to perceive terrain geometry on rough terrain. Right: The Unitree G1 humanoid equipped with a LiDAR sensor for environmental awareness. APPENDIX D. MANIPULATION D.1 Environment Setup The environment used is AIRBOT Play PickCube, as shown in view at source ↗

**Figure 15.** Figure 15: The AIRBOT Play PickCube manipulation environment setup. The robot needs to grasp the green cube and lift it to the target view at source ↗

**Figure 16.** Figure 16: Render comparison. From left to right: Real World, Ours, Isaac Lab, ManiSkill3, and Mujoco. view at source ↗

**Figure 17.** Figure 17: Learning curves for the PickCube manipulation task. view at source ↗

read the original abstract

Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-Playground, a multi-modal simulation framework designed to accelerate end-to-end perceptual learning. We develop a novel high-performance parallel physics engine, specifically designed to integrate with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to ensure high-fidelity synchronization. Our system achieves a breakthrough throughput of 10^4 FPS at 640x480 resolution, significantly lowering the barrier for large-scale visual RL. Additionally, we introduce an automated Real2Sim workflow that reconstructs photorealistic, physically consistent, and memory-efficient environments, streamlining the generation of complex simulation-ready scenes. Extensive experiments on locomotion, navigation, and manipulation demonstrate that GS-Playground effectively bridges the perceptual and physical gaps across diverse embodied tasks. Project homepage: https://gsplayground.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GS-Playground pairs a custom parallel physics engine with batched 3DGS rendering to target the speed gap in vision-based robot sims, but the headline throughput and transfer claims sit on an integration whose fidelity details are missing from the abstract.

read the letter

The main takeaway is a systems paper that builds a high-throughput simulator by linking parallel physics updates directly to batched 3D Gaussian Splatting rendering, plus an automated pipeline that turns real captures into simulation-ready scenes. This setup is meant to let vision-centric RL run at scales that have so far been limited to proprioceptive tasks. The integration and the Real2Sim workflow are the concrete new pieces; prior 3DGS work and existing physics engines exist separately, but the synchronized batch design at 10^4 FPS and the end-to-end asset generation are presented as the practical advance. The paper correctly identifies the rendering overhead and manual modeling burden as real blockers and shows experiments across locomotion, navigation, and manipulation to argue that the gap can be closed. Those are useful directions. The soft spots are in the evidence for the central claims. The abstract states the FPS number and successful transfer but gives no numbers on synchronization error, per-step latency, depth or normal consistency under contact, or ablation results that isolate the batch rendering overhead. The stress-test concern about possible desync in contact-rich regimes therefore lands; without those metrics it is difficult to judge whether the reported speed preserves the physical consistency required for policy transfer. If the full paper supplies the missing ablations and quantitative transfer results, that would tighten the argument considerably. This work is aimed at researchers who need fast photorealistic environments for large-scale visual RL or sim-to-real studies. A reader already running parallel simulators or working on 3DGS rendering would find the implementation choices worth examining. It deserves peer review because the engineering problem is timely and the proposed solution is substantial enough to benefit from referee feedback on the integration and experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GS-Playground, a multi-modal simulation framework for vision-informed robot learning. It combines a custom parallel physics engine with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to achieve claimed high-throughput photorealistic simulation, reports a throughput of 10^4 FPS at 640x480 resolution, presents an automated Real2Sim workflow for reconstructing photorealistic and physically consistent environments, and evaluates the system on locomotion, navigation, and manipulation tasks to demonstrate bridging of perceptual and physical gaps.

Significance. If the throughput, synchronization fidelity, and policy-transfer results hold under rigorous validation, the work could meaningfully advance large-scale visual RL by reducing rendering overhead and manual asset creation, enabling more realistic sim-to-real transfer in contact-rich settings. The automated Real2Sim pipeline, if shown to produce memory-efficient and consistent scenes, represents a practical contribution to simulation asset generation.

major comments (2)

[Abstract] Abstract: The headline claim of 10^4 FPS throughput at 640x480 resolution is presented as arising from the novel batch 3DGS + parallel physics integration, yet the abstract supplies no quantitative results, baselines, error bars, or experimental details on achieved FPS, rendering latency, or physics update rates. This absence prevents evaluation of the central performance claim.
[Method (batch 3DGS rendering pipeline integration with parallel physics engine)] Method (batch 3DGS rendering pipeline integration with parallel physics engine): The claim that the integration delivers high-fidelity synchronization without loss of speed or visual/physical accuracy is load-bearing for the throughput result. No quantitative bounds on maximum frame-to-physics lag, desync error metrics, ablation studies isolating synchronization overhead, or comparisons of rendered depth/normal consistency against non-batched baselines in contact-rich regimes are referenced, leaving open the possibility that modest desynchronization under locomotion or manipulation loads would undermine the reported performance while preserving the required fidelity.

minor comments (2)

[Abstract] Abstract: The statement that 'extensive experiments... demonstrate that GS-Playground effectively bridges the perceptual and physical gaps' is not accompanied by any summary statistics, success rates, or comparison to prior simulators.
[Abstract] The project homepage is referenced but no information on code release, reproducibility artifacts, or dataset availability is provided in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of 10^4 FPS throughput at 640x480 resolution is presented as arising from the novel batch 3DGS + parallel physics integration, yet the abstract supplies no quantitative results, baselines, error bars, or experimental details on achieved FPS, rendering latency, or physics update rates. This absence prevents evaluation of the central performance claim.

Authors: The abstract is a high-level summary, while the full quantitative evaluation—including the 10^4 FPS measurement at 640x480, comparisons against baseline simulators, error bars from multiple runs, rendering latency, and physics update rates—is provided in the Experiments section with supporting tables and figures. To improve immediate evaluability of the central claim, we will revise the abstract to include concise references to these metrics and experimental conditions. revision: partial
Referee: [Method (batch 3DGS rendering pipeline integration with parallel physics engine)] Method (batch 3DGS rendering pipeline integration with parallel physics engine): The claim that the integration delivers high-fidelity synchronization without loss of speed or visual/physical accuracy is load-bearing for the throughput result. No quantitative bounds on maximum frame-to-physics lag, desync error metrics, ablation studies isolating synchronization overhead, or comparisons of rendered depth/normal consistency against non-batched baselines in contact-rich regimes are referenced, leaving open the possibility that modest desynchronization under locomotion or manipulation loads would undermine the reported performance while preserving the required fidelity.

Authors: The parallel physics engine and batch 3DGS pipeline are explicitly co-designed for lockstep updates, with the reported 10^4 FPS achieved on contact-rich locomotion, navigation, and manipulation tasks serving as indirect evidence that synchronization overhead remains negligible. However, we acknowledge that explicit quantitative validation would strengthen the claim. In the revised manuscript we will add: (i) measured bounds on frame-to-physics lag, (ii) desync error metrics (position/velocity discrepancies), (iii) ablation studies isolating synchronization overhead, and (iv) depth/normal consistency comparisons against non-batched baselines under manipulation loads. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and empirical validation

full rationale

The paper is a systems contribution describing an integrated simulator (batch 3DGS rendering + parallel physics engine) and an automated Real2Sim workflow. No mathematical derivations, equations, fitted parameters presented as predictions, or first-principles results appear in the abstract or method descriptions. Throughput figures and synchronization claims are presented as measured outcomes of the implementation rather than derived quantities that reduce to their own inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is used to justify core claims. The central assertions are externally falsifiable via benchmarks on locomotion, navigation, and manipulation tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about rendering-physics synchronization and reconstruction fidelity rather than new mathematical derivations or invented entities.

axioms (2)

domain assumption Batch 3D Gaussian Splatting can be integrated with a parallel physics engine to maintain high-fidelity visual and physical synchronization at scale.
Invoked directly in the description of the high-performance rendering pipeline and throughput claim.
domain assumption An automated Real2Sim workflow can produce photorealistic yet physically consistent and memory-efficient simulation environments suitable for policy transfer.
Central to the second main contribution and the claim that perceptual and physical gaps are bridged.

pith-pipeline@v0.9.0 · 5688 in / 1406 out tokens · 59918 ms · 2026-05-07T16:07:52.945626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 41 canonical work pages · 9 internal anchors

[1]

Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin,

Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Maria Vittoria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin for real-world robot policy evaluation.arXiv preprint arXiv:2504.03597, 2025

work page arXiv 2025
[2]

Akkaya, M

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

work page arXiv 1910
[3]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

Leonardo Barcellona, Andrii Zadaianchuk, Davide Al- legro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imag- ination.arXiv preprint arXiv:2412.14957, 2024

work page arXiv 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot con- trol. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

work page internal anchor Pith review arXiv 2024
[5]

NavDP: Learning sim-to-real navigation diffusion policy with privileged information guidance

Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun Chen, Tai Wang, and Jiangmiao Pang. NavDP: Learning sim-to-real navigation diffusion policy with privileged information guidance. arXiv preprint arXiv:2501.04610, 2025

work page arXiv 2025
[6]

Learning motion skills with adaptive assistive curriculum force in humanoid robots.arXiv preprint arXiv:2506.23125, 2025

Zhanxiang Cao, Yang Zhang, Buqing Nie, Huangxuan Lin, Haoyang Li, and Yue Gao. Learning motion skills with adaptive assistive curriculum force in humanoid robots.arXiv preprint arXiv:2506.23125, 2025

work page arXiv 2025
[7]

Visual dexterity: In- hand reorientation of novel and complex object shapes

Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Ed- ward Adelson, and Pulkit Agrawal. Visual dexterity: In- hand reorientation of novel and complex object shapes. Science Robotics, 8(84):eadc9244, 2023

2023
[8]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review arXiv 2025
[9]

Navila: Legged robot vision-language- action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language- action model for navigation. InRSS, 2025

2025
[10]

Learning quadrupedal locomotion on deformable terrain.Science Robotics, 8(74):eade2256, 2023

Suyoung Choi, Gwanghyeon Ji, Jeongsoo Park, Hyeongjun Kim, Juhyeok Mun, Jeong Hyun Lee, and Jemin Hwangbo. Learning quadrupedal locomotion on deformable terrain.Science Robotics, 8(74):eade2256, 2023

2023
[11]

Gaussgym: An open-source real-to-sim frame- work for learning locomotion from pixels.arXiv preprint arXiv:2510.15352, 2025

Alejandro Escontrela, Justin Kerr, Arthur Allshire, Jonas Frey, Rocky Duan, Carmelo Sferrazza, and Pieter Abbeel. Gaussgym: An open-source real-to-sim frame- work for learning locomotion from pixels.arXiv preprint arXiv:2510.15352, 2025

work page arXiv 2025
[12]

Mini-splatting: Repre- senting scenes with a constrained number of gaussians

Guangchi Fang and Bing Wang. Mini-splatting: Repre- senting scenes with a constrained number of gaussians. InEuropean Conference on Computer Vision, pages 165–
[13]

Re3 Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

Xiaoshen Han, Minghuan Liu, Yilun Chen, Junqiu Yu, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, and Jiangmiao Pang. Re 3 sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation.arXiv preprint arXiv:2502.08645, 2025

work page arXiv 2025
[14]

Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primi- tives

Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. Speedy-splat: Fast 3d gaussian splatting with sparse pixels and sparse primi- tives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21537–21546, 2025

2025
[15]

Pup 3d-gs: Principled uncertainty pruning for 3d gaussian splatting

Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d-gs: Principled uncertainty pruning for 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5949–5958, 2025

2025
[16]

Available: https://arxiv.org/abs/2502.01143

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simula- tion and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

work page arXiv 2025
[17]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv: 2511.15200,

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Casta˜neda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

work page arXiv 2025
[18]

Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

2019
[19]

Discoverse: Efficient robot simulation in complex high-fidelity environments,

Yufei Jia, Guangyu Wang, Yuhang Dong, Junzhe Wu, Yupei Zeng, Haonan Lin, Zifan Wang, Haizhou Ge, Weibin Gu, Kairui Ding, et al. Discoverse: Efficient robot simulation in complex high-fidelity environments.arXiv preprint arXiv:2507.21981, 2025

work page arXiv 2025
[20]

GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation.arXiv preprint arXiv:2510.20813, 2025

work page arXiv 2025
[21]

Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025
[22]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

2025
[23]

Robust in-hand reorientation with hi- erarchical rl-based motion primitives and model-based regrasping.IEEE Robotics and Automation Practice, 1: 12–17, 2025

Yongpeng Jiang, Mingrui Yu, Chen Chen, Yongyi Jia, and Xiang Li. Robust in-hand reorientation with hi- erarchical rl-based motion primitives and model-based regrasping.IEEE Robotics and Automation Practice, 1: 12–17, 2025

2025
[24]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023

2023
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[26]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review arXiv 2023
[27]

Rma: Rapid motor adaptation for legged robots,

Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

work page arXiv 2021
[28]

arXiv preprint arXiv:2509.09674 , year=

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page arXiv 2025
[29]

Robogsim: A real2sim2real robotic gaus- sian splatting simulator,

Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruip- ing Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839, 2024

work page arXiv 2024
[30]

CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. CLONE: Closed-loop whole-body humanoid teleoperation for long-horizon tasks. In9th Annual Conference on Robot Learning (CoRL), 2025

2025
[31]

Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation.arXiv preprint arXiv:2501.16764, 2025

Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, and Yadong Mu. Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation.arXiv preprint arXiv:2501.16764, 2025

work page arXiv 2025
[32]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review arXiv 2023
[33]

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid rep- resentation

Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, et al. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid rep- resentation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15379–15386. IEEE, 2025

2025
[34]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review arXiv 2021
[35]

Walk these ways: Tuning robot control for generalization with mul- tiplicity of behavior

Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with mul- tiplicity of behavior. InConference on Robot Learning, pages 22–31. PMLR, 2023

2023
[36]

Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

Gabriel B Margolis, Ge Yang, Kartik Paigwar, Tao Chen, and Pulkit Agrawal. Rapid locomotion via reinforcement learning.The International Journal of Robotics Research, 43(4):572–587, 2024

2024
[37]

Jan Matas, Stephen James, and Andrew J. Davison. Sim-to-real reinforcement learning for deformable object manipulation. InConference on Robot Learning (CoRL), pages 734–743. PMLR, 2018

2018
[38]

Sharp monocular view synthesis in less than a second.arXiv preprint arXiv:2512.10685, 2025

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Ama ˜AG ¸ l Delaunoy, Tian Fang, et al. Sharp monocular view synthesis in less than a second.arXiv preprint arXiv:2512.10685, 2025

work page arXiv 2025
[39]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review arXiv 2025
[40]

Splat- sim: Zero-shot sim2real transfer of rgb manipulation poli- cies using gaussian splatting

M Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhisesh Silwal. Splat- sim: Zero-shot sim2real transfer of rgb manipulation poli- cies using gaussian splatting. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6502–6509. IEEE, 2025

2025
[42]

URL https://arxiv.org/abs/2408.00714

work page internal anchor Pith review arXiv
[43]

An ex- tensible, data-oriented architecture for high-performance, many-world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

Brennan Shacklett, Luc Guy Rosenzweig, Zhiqiang Xie, Bidipta Sarkar, Andrew Szot, Erik Wijmans, Vladlen Koltun, Dhruv Batra, and Kayvon Fatahalian. An ex- tensible, data-oriented architecture for high-performance, many-world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

2023
[44]

Blind bipedal stair traversal via sim-to-real reinforcement learning.arXiv preprint arXiv:2105.08328, 2021

Jonah Siekmann, Kevin Green, John Warila, Alan Fern, and Jonathan Hurst. Blind bipedal stair traversal via sim-to-real reinforcement learning.arXiv preprint arXiv:2105.08328, 2021

work page arXiv 2021
[45]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silve- strov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions.arXiv preprint arXiv:2109.07161, 2021

work page arXiv 2021
[46]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Na- gaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and...

2025
[47]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[48]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

2023
[49]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 5294–5306, 2025

2025
[50]

Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot

Zifan Wang, Yufei Jia, Lu Shi, Haoyu Wang, Haizhou Zhao, Xueyang Li, Jinni Zhou, Jun Ma, and Guyue Zhou. Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 10770–10776. IEEE, 2024

2024
[51]

Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025

Zifan Wang, Teli Ma, Yufei Jia, Xun Yang, Jiaming Zhou, Wenlong Ouyang, Qiang Zhang, and Junwei Liang. Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025

work page arXiv 2025
[52]

Opening the sim- to-real door for humanoid pixel-to-action policy transfer

Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao, Zhengyi Luo, Xingye Da, Fernando Casta ˜neda, Guanya Shi, Shankar Sastry, et al. Opening the sim- to-real door for humanoid pixel-to-action policy transfer. arXiv preprint arXiv:2512.01061, 2025

work page arXiv 2025
[53]

Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,

Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Novel demonstration generation with gaussian splatting en- ables robust one-shot manipulation.arXiv preprint arXiv:2504.13175, 2025

work page arXiv 2025
[54]

DexterityGen: Foundation controller for unprecedented dexterity

Zhao-Heng Yin, Changhao Wang, Luis Pineda, Fran- cois Hogan, Krishna Bodduluri, Akash Sharma, Patrick Lancaster, Ishita Prasad, Mrinal Kalakrishnan, Jitendra Malik, et al. DexterityGen: Foundation controller for unprecedented dexterity. InProceedings of Robotics: Science and Systems (RSS), 2025

2025
[55]

Real2render2real: Scaling robot data without dynamics simulation or robot hardware,

Justin Yu, Letian Fu, Huang Huang, Karim El- Refai, Rares Andrei Ambrus, Richard Cheng, Muham- mad Zubair Irshad, and Ken Goldberg. Real2render2real: Scaling robot data without dynamics simulation or robot hardware.arXiv preprint arXiv:2505.09601, 2025

work page arXiv 2025
[56]

Mujoco playground,

Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur All- shire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844, 2025

work page arXiv 2025
[57]

A vision-language-action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

work page arXiv 2025
[58]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

work page arXiv 2024
[59]

Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jia- hang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025
[60]

Real- to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xi- aochen Hu, Changxi Zheng, and Yunzhu Li. Real- to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

work page arXiv 2025
[61]

Keep on going: Learning robust humanoid motion skills via selective adversarial training.arXiv preprint arXiv:2507.08303, 2025

Yang Zhang, Zhanxiang Cao, Buqing Nie, Haoyang Li, Zhong Jiangwei, Qiao Sun, Xiaoyi Hu, Xiaokang Yang, and Yue Gao. Keep on going: Learning robust humanoid motion skills via selective adversarial training.arXiv preprint arXiv:2507.08303, 2025

work page arXiv 2025
[62]

Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2401.01454, 2024

Xian Zhou, Yiling Qiao, Zhenjia Xu, TH Wang, Z Chen, J Zheng, Z Xiong, Y Wang, M Zhang, P Ma, et al. Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2401.01454, 2024

work page arXiv 2024
[63]

Vision-language-action model with open- world embodied reasoning from pretrained knowledge

Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Vision-language-action model with open- world embodied reasoning from pretrained knowledge. arXiv preprint arXiv:2505.21906, 2025

work page arXiv 2025
[64]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review arXiv 2009
[65]

Robot parkour learning,

Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christo- pher Atkeson, Soeren Schwertfeger, Chelsea Finn, and Hang Zhao. Robot parkour learning.arXiv preprint arXiv:2309.05665, 2023. Appendix TABLE OFCONTENTS Appendix A. Physics 13 A.1 Shaking Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Appendix B. 3D...

work page arXiv 2023