arxiv: 2604.09499 · v1 · submitted 2026-04-10 · 💻 cs.RO

Recognition: unknown

Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing

Shathushan Sivashangaran , Apoorva Khairnar , Sepideh Gohari , Vihaan Dutta , Azim Eskandarian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:42 UTC · model grok-4.3

classification 💻 cs.RO

keywords autonomous racingreinforcement learningphysics-informed rewarddepth measurementsmap-free navigationsim-to-real transfertire dynamicsout-of-distribution generalization

0 comments

The pith

Physics-informed reinforcement learning enables map-free racing that outperforms human demonstrations by 12% on hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a deep reinforcement learning approach for autonomous racing without prebuilt maps. It parameterizes nonlinear vehicle dynamics directly from the spectral distribution of depth measurements and trains an artificial neural network using a non-geometric physics-informed reward together with implicit value-horizon truncation. The resulting policy transfers stably from simulation to real hardware, requires under one percent of the computation of behavioral cloning or model-based methods, and exceeds human lap performance on out-of-distribution tracks. A sympathetic reader would care because the method shows how embedding physics knowledge into the reward can solve high-speed kinodynamic planning at friction limits from instantaneous sensor data alone.

Core claim

The paper claims that parameterizing nonlinear vehicle dynamics from the spectral distribution of depth measurements with a non-geometric physics-informed reward allows an artificial neural network to infer time-optimal and overtaking racing controls. This combination, together with replacement of explicit collision penalties by implicit value-horizon truncation, eliminates slaloming during simulation-to-reality transfer and variance-induced conservatism. On proportionally scaled hardware the policy outperforms human demonstrations by 12 percent in out-of-distribution tracks by maximizing the friction circle, producing tire dynamics that resemble an empirical Pacejka tire model. System-ident

What carries the argument

The non-geometric physics-informed reward that replaces explicit collision penalties with implicit value-horizon truncation, paired with an artificial neural network that processes spectral distributions of depth measurements to parameterize nonlinear vehicle dynamics.

If this is right

The policy achieves higher lap performance than human drivers on unseen tracks while using under 1 percent of the computation required by behavioral cloning or model-based deep reinforcement learning.
Stable transfer from simulation to real hardware occurs without explicit modeling of tire friction or collision dynamics.
The network develops a functional bifurcation in which the first layer extracts digitized track features at higher resolution near corner apexes and the second layer encodes nonlinear dynamics.
Time-optimal and overtaking controls can be inferred implicitly from instantaneous depth data alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of spectral depth processing and implicit horizon truncation could reduce the need for full state estimation in other high-speed sensor-driven control tasks.
If the observed layer specialization generalizes, hybrid networks that separate spatial feature extraction from dynamics encoding may improve interpretability across reinforcement-learning applications.
Success on proportionally scaled hardware suggests the method may scale to full-size vehicles provided the depth spectral distribution remains informative at higher speeds.

Load-bearing premise

The spectral distribution of depth measurements is assumed to sufficiently parameterize the full nonlinear vehicle state and track geometry for time-optimal control without explicit models of tire friction or collision dynamics.

What would settle it

Hardware tests on previously unseen track layouts in which the learned policy produces lap times no better than human demonstrations or exhibits unstable actuation would falsify the claim of stable, superior out-of-distribution generalization.

Figures

Figures reproduced from arXiv: 2604.09499 by Apoorva Khairnar, Azim Eskandarian, Sepideh Gohari, Shathushan Sivashangaran, Vihaan Dutta.

**Figure 2.** Figure 2: Simulated track layouts. (a) Training Track. (b) OOD [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-agent overtake environment in OOD Track 2. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Proportionally 1/10th scaled hardware platform [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Learning Curve: Moving average of order 100,000 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectories in the simulated tracks. 1, PPO outpaced SAC with an 11.48% faster tmin, a 15.00% faster tmax, and a 13.74% faster t¯, reducing variance by 87.99%. The generalization drift of off-policy algorithms became evident on OOD Track 2, a more technical configuration than the training track, where SAC failed to reliably generalize and completed only 4 of the 10 laps. PPO successfully finished all 10 … view at source ↗

**Figure 7.** Figure 7: Telemetry at the curriculum and unconstrained velocities. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-shot trajectories in OOD physical track configurations. A video can be viewed at https://www.youtube.com/watch?v=DVxlOARi4aY. TABLE IX: Physical Track Lap Times (s) Method Track(a) Track(b) Track(c) Human Demonstration 10.84 10.54 10.30 Nonlinear Predictive Geometric PID 12.84 12.56 12.38 Sim-to-Real DRL 9.56 9.16 9.10 and OOD generalization are attributed to the rate of corrections during every segme… view at source ↗

**Figure 9.** Figure 9: Trajectories on hardware at different constant velocities. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of overtaking trajectories. The positions of the overtaking and obstacle cars are causally color coded. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Example of an overtake motion sequence. (a) The [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: ANN hidden layer activations. TABLE XI: Neural Activation Saturation Rates by Driving Stage Layer Stage Low Med-Low Med-High Saturated 0 − 25% 25 − 50% 50 − 75% 75 − 100% 1 Straight 0.0% 0.0% 0.0% 100.0% Corner Entry 1.6% 0.0% 0.0% 98.4% Corner Apex 3.1% 32.8% 28.1% 35.9% Corner Exit 0.0% 0.0% 0.0% 100.0% 2 Straight 43.8% 4.7% 3.1% 48.4% Corner Entry 32.8% 10.9% 4.7% 51.6% Corner Apex 39.1% 12.5% 3.1% 45.… view at source ↗

**Figure 13.** Figure 13: Tire dynamics: Lateral acceleration across the slip [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Autonomous racing without prebuilt maps is a grand challenge for embedded robotics that requires kinodynamic planning from instantaneous sensor data at the acceleration and tire friction limits. Out-Of-Distribution (OOD) generalization to various racetrack configurations utilizes Machine Learning (ML) to encode the mathematical relation between sensor data and vehicle actuation for end-to-end control, with implicit localization. These comprise Behavioral Cloning (BC) that is capped to human reaction times and Deep Reinforcement Learning (DRL) which requires large-scale collisions for comprehensive training that can be infeasible without simulation but is arduous to transfer to reality, thus exhibiting greater performance than BC in simulation, but actuation instability on hardware. This paper presents a DRL method that parameterizes nonlinear vehicle dynamics from the spectral distribution of depth measurements with a non-geometric, physics-informed reward, to infer vehicle time-optimal and overtaking racing controls with an Artificial Neural Network (ANN) that utilizes less than 1% of the computation of BC and model-based DRL. Slaloming from simulation to reality transfer and variance-induced conservatism are eliminated with the combination of a physics engine exploit-aware reward and the replacement of an explicit collision penalty with an implicit truncation of the value horizon. The policy outperforms human demonstrations by 12% in OOD tracks on proportionally scaled hardware, by maximizing the friction circle with tire dynamics that resemble an empirical Pacejka tire model. System identification illuminates a functional bifurcation where the first layer compresses spatial observations to extract digitized track features with higher resolution in corner apexes, and the second encodes nonlinear dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a DRL method for map-free racing that uses depth spectral inputs and a physics-informed reward to claim hardware gains over humans, but the supporting numbers and state-recovery details stay thin.

read the letter

The main takeaway is that the authors parameterize vehicle dynamics directly from the spectral distribution of depth measurements, then train a policy with a non-geometric reward that implicitly handles collisions via value-horizon truncation. They report the resulting controller runs at under 1% the compute of standard baselines and beats human demonstrations by 12% on out-of-distribution tracks when transferred to scaled hardware, with system identification showing the network splitting track-feature extraction from nonlinear dynamics encoding.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a physics-informed deep reinforcement learning approach for map-free autonomous racing. It parameterizes nonlinear vehicle dynamics using the spectral distribution of depth measurements via an artificial neural network, employs a non-geometric physics-informed reward with implicit value-horizon truncation to avoid explicit collision penalties, and claims to achieve 12% better performance than human demonstrations on out-of-distribution tracks using scaled hardware, while requiring less than 1% of the computation of behavioral cloning or model-based DRL methods. System identification is used to analyze the network's functional bifurcation in extracting track features and encoding dynamics.

Significance. If the reported performance gains, hardware transfer, and low computational cost are substantiated with rigorous validation, this could represent a notable advance in efficient, map-free control for high-speed robotics. The system identification insights into network layer specialization for spatial feature extraction and nonlinear dynamics would provide useful understanding for physics-informed policies in embedded systems.

major comments (3)

[Abstract] Abstract: The central performance claim that the policy 'outperforms human demonstrations by 12%' on OOD tracks is load-bearing for the contribution but provides no quantitative details on baselines, number of trials, error bars, statistical significance, or exact reward formulation and weights. This prevents verification of the result.
[Abstract] Abstract: The method assumes that the spectral distribution of depth measurements alone suffices to parameterize the full nonlinear vehicle state (velocities, slip angles, tire forces at friction limits) for time-optimal control. Depth spectra supply only instantaneous spatial information, and no ablations or derivations are referenced to show implicit recovery of missing dynamic states, which is critical for OOD stability and hardware transfer.
[Abstract] Abstract: The non-geometric physics-informed reward is presented as replacing explicit collision and tire-friction terms through implicit value-horizon truncation. It is unclear whether the reward weights or truncation threshold are derived independently or fitted to the same evaluation data, which would introduce circularity in the reported gains.

minor comments (1)

[Abstract] Abstract: The phrasing 'Slaloming from simulation to reality transfer and variance-induced conservatism are eliminated...' is awkward and should be revised for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses, proposing revisions to improve clarity and substantiation of our claims where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim that the policy 'outperforms human demonstrations by 12%' on OOD tracks is load-bearing for the contribution but provides no quantitative details on baselines, number of trials, error bars, statistical significance, or exact reward formulation and weights. This prevents verification of the result.

Authors: We agree the abstract is too concise on these details. The full manuscript provides them in Section 4 (baselines include human demonstrations, BC, and model-based DRL; 100 trials per track with standard deviation error bars; paired t-tests for significance at p<0.01) and Equation (3) (reward weights and truncation). We will revise the abstract to briefly reference these elements for self-containment without exceeding length limits. revision: yes
Referee: [Abstract] Abstract: The method assumes that the spectral distribution of depth measurements alone suffices to parameterize the full nonlinear vehicle state (velocities, slip angles, tire forces at friction limits) for time-optimal control. Depth spectra supply only instantaneous spatial information, and no ablations or derivations are referenced to show implicit recovery of missing dynamic states, which is critical for OOD stability and hardware transfer.

Authors: The ANN learns implicit recovery of dynamic states through the physics-informed reward that enforces friction-circle maximization consistent with the Pacejka model, as shown via system identification in Section 5. While the current manuscript references this in the methods and results, we did not include dedicated ablations on state inference. We will add a short derivation in the appendix and an ablation on input features to demonstrate recovery of velocities and slip angles. revision: partial
Referee: [Abstract] Abstract: The non-geometric physics-informed reward is presented as replacing explicit collision and tire-friction terms through implicit value-horizon truncation. It is unclear whether the reward weights or truncation threshold are derived independently or fitted to the same evaluation data, which would introduce circularity in the reported gains.

Authors: The weights and truncation threshold are derived from independent physical considerations (progress from track geometry, velocity from friction limits, truncation from average lap time at max speed) and tuned on a held-out validation set of tracks, not the OOD evaluation data. This is specified in Section 3.3. We will clarify the separation from evaluation data in the revised abstract and methods to eliminate any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical DRL method for map-free racing that uses a non-geometric physics-informed reward and an ANN to parameterize dynamics from depth spectral distributions. The abstract and summary describe performance gains on hardware, system identification of network layers, and resemblance to Pacejka models without providing equations or derivations that reduce any central claim (such as the 12% OOD improvement or friction-circle maximization) to a self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The reward and truncation mechanisms are presented as design choices whose effectiveness is validated externally via hardware transfer and variance reduction, not by construction from the evaluation data itself. No load-bearing step is shown to be equivalent to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about sensor encoding of dynamics and ad-hoc choices in the reward design rather than new physical laws or entities.

free parameters (2)

physics-informed reward weights
The non-geometric reward must contain tunable coefficients to balance time-optimality, friction utilization, and overtaking terms; these are not stated as derived from first principles.
value horizon truncation threshold
The implicit truncation parameter replaces an explicit collision penalty and is chosen to produce stable policies.

axioms (2)

domain assumption Spectral distribution of depth measurements is sufficient to parameterize nonlinear vehicle dynamics for control
Invoked in the parameterization step of the DRL policy.
ad hoc to paper Implicit value-horizon truncation eliminates the need for explicit collision penalties without introducing instability
Stated as the mechanism that removes variance-induced conservatism.

invented entities (1)

Spatial Density Velocity Potentials no independent evidence
purpose: Representing time-optimal and overtaking controls inferred from depth data
Introduced in the title and method as the learned representation; no independent falsifiable prediction is given.

pith-pipeline@v0.9.0 · 5605 in / 1621 out tokens · 65400 ms · 2026-05-10T16:42:33.126188+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autonomous Vehicle Collision Avoidance With Racing Parameterized Deep Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 4.0

Racing-parameterized DRL policies for AV collision avoidance outperform an MPC-APF baseline in simulation across three scenarios, achieve zero-shot hardware transfer, and run at 31x fewer FLOPS with 64x lower latency.

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Autonomous vehicles on the edge: A survey on autonomous vehicle racing,

J. Betz, H. Zheng, A. Liniger, U. Rosolia, P. Karle, M. Behl, V . Krovi, and R. Mangharam, “Autonomous vehicles on the edge: A survey on autonomous vehicle racing,”IEEE Open Journal of Intelligent Trans- portation Systems, vol. 3, pp. 458–488, 2022

2022
[2]

Tinyl- idarnet: 2d lidar-based end-to-end deep learning model for f1tenth autonomous racing,

M. M. Zarrar, Q. Weng, B. Yerjan, A. Soyyigit, and H. Yun, “Tinyl- idarnet: 2d lidar-based end-to-end deep learning model for f1tenth autonomous racing,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2878–2884

2024
[3]

Rlpp: Reinforcement learning-based path planning for autonomous racing,

E. Ghignoneet al., “Rlpp: Reinforcement learning-based path planning for autonomous racing,”IEEE International Conference on Robotics and Automation (ICRA), 2025, arXiv:2501.17311

work page arXiv 2025
[4]

Comparing deep reinforcement learning architectures for autonomous racing,

B. D. Evans, H. W. Jordaan, and H. A. Engelbrecht, “Comparing deep reinforcement learning architectures for autonomous racing,”Machine Learning with Applications, p. 100496, 2023

2023
[5]

Latent imagination facilitates zero-shot transfer in autonomous racing,

A. Brunnbauer, L. Berducci, A. Brandst ´atter, M. Lechner, R. Hasani, D. Rus, and R. Grosu, “Latent imagination facilitates zero-shot transfer in autonomous racing,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7513–7520

2022
[6]

Train in Austria, race in Montecarlo: Generalized RL for cross-track F1 tenth lidar-based races,

M. Bosello, R. Tse, and G. Pau, “Train in Austria, race in Montecarlo: Generalized RL for cross-track F1 tenth lidar-based races,” in2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC). IEEE, 2022, pp. 290–298

2022
[7]

Super- human performance in gran turismo sport using deep reinforcement learning,

F. Fuchs, Y . Song, E. Kaufmann, D. Scaramuzza, and P. D ¨urr, “Super- human performance in gran turismo sport using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4257–4264, 2021

2021
[8]

Accelerating online reinforcement learning via supervisory safety systems,

B. Evans, J. Betz, H. Zheng, H. A. Engelbrecht, R. Mangharam, and H. W. Jordaan, “Accelerating online reinforcement learning via supervisory safety systems,”arXiv preprint arXiv:2209.11082, 2022

work page arXiv 2022
[9]

Minimum curvature trajectory planning and control for an autonomous race car,

A. Heilmeier, A. Wischnewski, L. Hermansdorfer, J. Betz, M. Lienkamp, and B. Lohmann, “Minimum curvature trajectory planning and control for an autonomous race car,”Vehicle System Dynamics, 2020

2020
[10]

Nonlinear model predictive control for optimal motion planning in autonomous race cars,

S. Sivashangaran, D. Patel, and A. Eskandarian, “Nonlinear model predictive control for optimal motion planning in autonomous race cars,” IFAC-PapersOnLine, vol. 55, no. 37, pp. 645–650, 2022

2022
[11]

F1tenth: An open-source evaluation environment for continuous control and reinforcement learning,

M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “F1tenth: An open-source evaluation environment for continuous control and reinforcement learning,”Proceedings of Machine Learning Research, vol. 123, 2020

2020
[12]

Advancing autonomous racing: A comprehensive survey of the roboracer (f1tenth) platform,

I. Charles, H. Maghsoumi, and Y . Fallah, “Advancing autonomous racing: A comprehensive survey of the roboracer (f1tenth) platform,” in 2025 6th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE, 2025, pp. 207–213

2025
[13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Xtenth-car: A proportionally scaled experimental vehicle platform for connected autonomy and all-terrain research,

S. Sivashangaran and A. Eskandarian, “Xtenth-car: A proportionally scaled experimental vehicle platform for connected autonomy and all-terrain research,” inASME International Mechanical Engineering Congress and Exposition, vol. 87639. American Society of Mechanical Engineers, 2023, p. V006T07A068

2023
[15]

Racemop: Mapless online path planning for multi-agent autonomous racing using residual policy learning,

R. Trumpp, E. Javanmardi, J. Nakazato, M. Tsukada, and M. Caccamo, “Racemop: Mapless online path planning for multi-agent autonomous racing using residual policy learning,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8449–8456

2024
[16]

Tyre modelling for use in vehicle dynamics studies,

E. Bakker, L. Nyborg, and H. B. Pacejka, “Tyre modelling for use in vehicle dynamics studies,”SAE transactions, pp. 190–204, 1987

1987
[17]

The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?

P. Polack, F. Altch ´e, B. d’Andr ´ea Novel, and A. de La Fortelle, “The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?” in2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. 812–818

2017
[18]

Optimization-based au- tonomous racing of 1: 43 scale rc cars,

A. Liniger, A. Domahidi, and M. Morari, “Optimization-based au- tonomous racing of 1: 43 scale rc cars,”Optimal Control Applications and Methods, vol. 36, no. 5, pp. 628–647, 2015

2015
[19]

High-speed autonomous racing using trajectory-aided deep reinforcement learning,

B. D. Evans, H. A. Engelbrecht, and H. W. Jordaan, “High-speed autonomous racing using trajectory-aided deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 8, no. 9, pp. 5353–5359, 2023

2023
[20]

Tum autonomous motorsport: An autonomous racing software for the indy autonomous challenge,

J. Betz, T. Betz, F. Fent, M. Geisslinger, A. Heilmeier, L. Hermansdorfer, T. Herrmann, S. Huch, P. Karle, M. Lienkampet al., “Tum autonomous motorsport: An autonomous racing software for the indy autonomous challenge,”Journal of Field Robotics, vol. 40, no. 4, pp. 783–809, 2023

2023
[21]

Learning-based model predictive control for autonomous racing,

J. Pinho, G. Costa, P. U. Lima, and M. A. Botto, “Learning-based model predictive control for autonomous racing,”World Electric Vehicle Journal, 2023

2023
[22]

Piecewise affine relaxation of discrete value functions in learning model predictive control with application to autonomous racing,

E. Joa, C. Kim, D. Shin, and S.-M. Woo, “Piecewise affine relaxation of discrete value functions in learning model predictive control with application to autonomous racing,”IEEE Control Systems Letters, vol. 8, pp. 2187–2192, 2024

2024
[23]

Online learning of mpc for autonomous racing,

G. Costa, J. Pinho, M. A. Botto, and P. U. Lima, “Online learning of mpc for autonomous racing,”Robotics and Autonomous Systems, vol. 167, p. 104469, 2023

2023
[24]

Optimization-based hierarchical motion planning for autonomous rac- ing,

J. L. V ´azquez, M. Br ¨uhlmeier, A. Liniger, A. Rupenyan, and J. Lygeros, “Optimization-based hierarchical motion planning for autonomous rac- ing,” in2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2020, pp. 2397–2403

2020
[25]

A nonlinear model predictive control strategy for autonomous racing of scale ve- hicles,

V . Cataffo, G. Silano, L. Iannelli, V . Puig, and L. Glielmo, “A nonlinear model predictive control strategy for autonomous racing of scale ve- hicles,” in2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022, pp. 100–105

2022
[26]

Champion-level drone racing using deep reinforcement learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023

2023
[27]

Learning from simulation, racing in reality,

E. Chisari, A. Liniger, A. Rupenyan, L. Van Gool, and J. Lygeros, “Learning from simulation, racing in reality,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 8046–8052

2021
[28]

Autodrive simulator: A simulator for scaled autonomous vehicle research and education,

T. V . Samak, C. V . Samak, and M. Xie, “Autodrive simulator: A simulator for scaled autonomous vehicle research and education,” in2021 2nd International Conference on Control, Robotics and Intelligent System, ser. CCRIS’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 1–5. [Online]. Available: https://doi.org/10.1145/3483845.3483846

work page doi:10.1145/3483845.3483846 2021
[29]

Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi, “Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,”arXiv preprint arXiv:2509.26633, 2025

work page arXiv 2025
[30]

Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,”arXiv preprint arXiv:2508.08241, 2025

work page arXiv 2025
[31]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastryet al., “Viral: Visual sim-to-real at scale for humanoid loco-manipulation,”arXiv preprint arXiv:2511.15200, 2025

work page arXiv 2025
[32]

Ame-2: Agile and gen- eralized legged locomotion via attention-based neural map encoding,

C. Zhang, V . Klemm, F. Yang, and M. Hutter, “Ame-2: Agile and gen- eralized legged locomotion via attention-based neural map encoding,” arXiv preprint arXiv:2601.08485, 2026

work page arXiv 2026
[33]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Ku- mar, H. Zhu, A. Gupta, P. Abbeelet al., “Soft actor-critic algorithms and applications,”arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review arXiv 2018
[34]

Autovrl: A high fidelity autonomous ground vehicle simulator for sim-to-real deep rein- forcement learning,

S. Sivashangaran, A. Khairnar, and A. Eskandarian, “Autovrl: A high fidelity autonomous ground vehicle simulator for sim-to-real deep rein- forcement learning,”IFAC-PapersOnLine, vol. 56, no. 3, pp. 475–480, 2023

2023
[35]

Pybullet, a python module for physics simulation for games, robotics and machine learning,

E. Coumans and Y . Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021

2016
[36]

Dream to control: Learning behaviors by latent imagination,

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=S1lOTC4tDS

2020