arxiv: 2604.11805 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CV· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Mihir Prabhudesai , Aryan Satpathy , Yangmin Li , Zheyang Qin , Nikash Bhardwaj , Amir Zadeh , Chuan Li , Katerina Fragkiadaki

show 1 more author

Deepak Pathak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.RO

keywords physics simulatorsreinforcement learningsynthetic dataLLM reasoningsim-to-real transferIPhOphysical reasoning

0 comments

The pith

Training large language models solely on synthetic data from physics simulators improves their performance on International Physics Olympiad problems by 5-10 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that physics engines can generate unlimited synthetic question-answer pairs from random simulated scenes to train LLMs for physical reasoning via reinforcement learning. This sidesteps the shortage of real physics QA data on the internet and produces models that transfer directly to real olympiad questions without any real-world examples in training. The approach yields consistent gains across model sizes on the IPhO benchmark, suggesting simulators can scale supervision for scientific reasoning beyond human-curated sources.

Core claim

We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO problems by 5-10 percentage points across model sizes.

What carries the argument

Reinforcement learning applied to synthetic question-answer pairs generated from random scenes in physics simulation engines, which produces zero-shot transfer to real physics problems.

If this is right

LLMs can acquire physical reasoning skills from synthetic data alone without internet-scale QA pairs.
Physics simulators act as scalable generators for training data in domains where real examples are scarce.
Zero-shot transfer from simulation to real benchmarks holds across different model sizes.
Reinforcement learning on simulator outputs enables reasoning beyond the limitations of existing question-answer datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same simulator-driven RL pipeline could extend to other domains with accurate engines, such as chemistry or engineering mechanics.
Iterating the training loop inside the simulator might allow models to explore rare or extreme physical scenarios not present in real data.
Combining simulator training with minimal real-world fine-tuning could further close any remaining sim-to-real gaps on applied tasks.

Load-bearing premise

That measured gains on olympiad problems reflect genuine learning of physical principles instead of overfitting to simulator-specific patterns or unintended leakage from the test set.

What would settle it

Evaluating the trained models on a fresh set of olympiad problems that involve physical interactions or object types absent from the training simulators and finding no performance improvement over the base model.

read the original abstract

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL on simulator-generated physics QA gives 5-10 point IPhO gains, but experimental details are sparse.

read the letter

The one thing to know is that this paper trains LLMs with reinforcement learning on synthetic physics data from simulators and reports 5-10 point gains on real IPhO problems without any real-world training data. That's the core claim. The new part is the end-to-end setup: creating random scenes in physics engines, deriving QA pairs from the simulations, and using RL to optimize the model on that data. This sidesteps the scarcity of high-quality physics QA on the internet. It does a good job showing that sim-to-real transfer is possible for reasoning tasks, not just perception or control. The authors also make the code available, which helps others test the approach. On the downside, the abstract is light on specifics. We don't see the exact baselines, whether they compared to standard RL or supervised fine-tuning, or any checks for data contamination between the synthetic set and the olympiad problems. The gains could come from better handling of structured answers rather than deeper physics understanding. The stress test points this out accurately, and until those are addressed in the full text, the result feels preliminary. Readers working on LLM reasoning for STEM fields or synthetic data generation will find this relevant. It is the kind of paper that could spark follow-up on using simulators for other sciences. It is worth sending to peer review because the problem it tackles is real and the method is straightforward to build on, even if the current evidence needs bolstering. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that physics simulators can generate scalable synthetic QA pairs from random scenes, which when used to train LLMs via reinforcement learning produce zero-shot gains of 5-10 percentage points on International Physics Olympiad (IPhO) problems across model sizes. This is positioned as overcoming the scarcity of physics QA data relative to mathematics, with code released for reproducibility.

Significance. If the empirical transfer holds after proper controls, the approach would offer a concrete, simulator-driven alternative to internet-scale data for instilling physical reasoning in LLMs. The explicit release of code strengthens the contribution by enabling direct verification of the data-generation and RL pipeline.

major comments (2)

Abstract: the reported 5-10 percentage point IPhO gains are presented without any description of the baseline models, evaluation protocol (number of problems, prompting format), statistical tests, or variance across runs. These omissions are load-bearing because the central claim is that the gains reflect acquisition of physical reasoning rather than RL format effects or other confounds.
Experimental section: no quantitative details are supplied on decontamination between the synthetic scene distribution and IPhO problem statements, nor on ablations that isolate the contribution of the physics simulator versus generic RL on structured QA. Without these, the sim-to-real transfer cannot be distinguished from alternative explanations.

minor comments (2)

The code link is provided, but the manuscript should include a brief summary of the repository contents (e.g., scene generator, reward model, training scripts) to guide readers.
Notation for the RL objective and synthetic QA generation process could be formalized with a short equation or pseudocode block for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that will strengthen the presentation of our results and controls.

read point-by-point responses

Referee: Abstract: the reported 5-10 percentage point IPhO gains are presented without any description of the baseline models, evaluation protocol (number of problems, prompting format), statistical tests, or variance across runs. These omissions are load-bearing because the central claim is that the gains reflect acquisition of physical reasoning rather than RL format effects or other confounds.

Authors: We agree that the abstract should provide these details to support the central claim. In the revised manuscript we will expand the abstract to specify the baseline models (pre-RL LLMs and relevant controls), the evaluation protocol (exact number of IPhO problems, zero-shot prompting format), and report mean performance with standard deviation across multiple runs together with statistical significance tests. These additions will clarify that the observed gains arise from physical reasoning acquired via simulation rather than format or other artifacts. revision: yes
Referee: Experimental section: no quantitative details are supplied on decontamination between the synthetic scene distribution and IPhO problem statements, nor on ablations that isolate the contribution of the physics simulator versus generic RL on structured QA. Without these, the sim-to-real transfer cannot be distinguished from alternative explanations.

Authors: We acknowledge the necessity of these controls. We will add a dedicated subsection on data decontamination that reports quantitative metrics (e.g., embedding cosine similarity and n-gram overlap statistics) between the synthetic QA distribution and IPhO problem statements. We will also include ablation experiments that compare our physics-simulator RL pipeline against generic RL training on structured QA data lacking simulator-generated physics, thereby isolating the simulator's contribution to the transfer gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical pipeline: random scene generation in physics engines, creation of synthetic QA pairs, RL training on that data, and zero-shot evaluation on IPhO benchmarks. No equations, fitted parameters, or theoretical derivations are presented that could reduce to self-definitional inputs or self-citation chains. The central claim rests on reported performance deltas from sim-to-real transfer, which are externally falsifiable via controls and ablations rather than being forced by construction or renamed known results. The work is self-contained as an experimental demonstration without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are introduced; the work is an empirical ML experiment relying on standard RL and simulation assumptions.

pith-pipeline@v0.9.0 · 5538 in / 875 out tokens · 71579 ms · 2026-05-10T15:45:44.886441+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

training solely on synthetic simulated data improves performance on IPhO mechanics problems by 5–10 percentage points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages

[1]

URLhttps://arxiv.org/abs/2509.04259. 17 P . Shojaee, K. Meidani, S. Gupta, A. B. Farimani, and C. K. Reddy. Llm-sr: Scientific equation discovery via programming with large language models, 2025. URLhttps://arxiv.org/ abs/2404.18400. F. Tajwar, Y. Jiang, A. Thankaraj, S. S. Rahman, J. Z. Kolter, J. Schneider, and R. Salakhutdinov. Training a generally cur...

work page doi:10.1109/iros.2012.6386109 2025
[2]

An Ill Fated Satellite

URLhttps://arxiv.org/abs/1905.11481. H. C. Verma.Concepts of Physics: Part 1. Concepts of Physics. Bharati Bhawan Publishers & Distributors, Patna, India, 2017. ISBN 9788177091878. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of th...

work page arXiv 1905
[3]

35 USA PHO 2019 B3 A bead of mass𝑀 slides frictionlessly along a horizontal rail

We observe that while modern LLMs fail to simulate a target scene by generating raw simulator code (left), they can do so by extending our DSL with novel entities (right). 35 USA PHO 2019 B3 A bead of mass𝑀 slides frictionlessly along a horizontal rail. It is attached to a rigid, massless rod of length𝑅 with a ball of mass 𝑀 at the other end. The system i...

2019