Hierarchical Experimentalist Agents

Abhranil Chandra; Sankaran Vaidyanathan; Scott Niekum; Utsav Dhanuka; Varun Gandhi

arxiv: 2606.29315 · v1 · pith:N7IAKPLZnew · submitted 2026-06-28 · 💻 cs.AI · cs.LG

Hierarchical Experimentalist Agents

Abhranil Chandra , Sankaran Vaidyanathan , Utsav Dhanuka , Varun Gandhi , Scott Niekum This is my paper

Pith reviewed 2026-06-30 07:27 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords hierarchical agentsactive experimentationLLM agentsphysics simulationskill libraryInterphyre benchmarkin-context self-improvementtool-calling agents

0 comments

The pith

Hierarchical Experimentalist Agents let LLMs learn reusable skills through active experimentation, raising success on hard physics tasks from 2 percent to 77 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hierarchical Experimentalist Agents as an in-context framework that lets language models handle novel domains by designing experiments, extracting skills from results, and reusing them. It demonstrates this on a new physics simulation benchmark where standard agents fail but the new loop succeeds. The approach requires no model training, external data, or supervision. A sympathetic reader would care because it shows a path for agents to discover rather than recall the information needed for long-horizon physical tasks.

Core claim

HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. On the Interphyre benchmark the method raises the success rate of the same base model from 2 percent to as high as 77 percent on the hardest levels. Skills acquired only from easier levels transfer to achieve 44 percent success on harder levels without any further active experimentation.

What carries the argument

The hierarchical loop that alternates between experiment design in simulation, extraction of composable skills into a growing library, and integration of evidence to solve the original query.

If this is right

Agents can transfer skills learned on easier simulation levels to solve harder levels without additional experimentation.
The same base model reaches substantially higher success rates on tasks that require discovering physical interactions rather than recalling them.
The framework improves both closed-source and open-weight models and outperforms standard agent baselines such as ReAct and Reflexion.
No offline data, oracles, or external supervision are required for the performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be applied to other interactive simulators such as chemistry or robotics environments where hypotheses must be tested through actions.
Larger skill libraries built over many sessions might enable cumulative improvement across entirely new task families.
The reusability result suggests that once a modest set of skills is acquired, further gains may come from composition rather than repeated experimentation.

Load-bearing premise

The performance gains are attributable to the hierarchical experimentalist loop and skill library rather than to prompt length, specific tool-calling format, or other unablated factors in the agent scaffolding.

What would settle it

An ablation study that disables the skill library or replaces the experiment-design step with random interventions and measures whether success rates on the hardest Interphyre levels fall back near the 2 percent baseline would settle the claim.

Figures

Figures reproduced from arXiv: 2606.29315 by Abhranil Chandra, Sankaran Vaidyanathan, Scott Niekum, Utsav Dhanuka, Varun Gandhi.

**Figure 1.** Figure 1: Overview of the HEXA framework on Interphyre physics puzzles. (a) Baseline ReAct: each seed is solved independently with no cross-seed learning. Each seed consists of alternating Get-State-Info calls (scene layout, gap analysis) and Simulate-Action calls (place red ball at (x, y, r)), terminating in success (✓) or failure (×). (b) HEXA: seeds are grouped into a meta-episode. After each seed, an evolver age… view at source ↗

**Figure 2.** Figure 2: HEXA learns reusable skills through an actor–evolver–retriever loop. (a)An example on catapult level illustrates the kind of physics reasoning skills HEXA distills. (b) In each round, the actor generates reward-tagged trajectories, the evolver updates the skill bank, and the retriever injects the most relevant skills back into the actor for the next attempt. HEXA is a training-free, in-context reinforcemen… view at source ↗

**Figure 3.** Figure 3: INTERPHYRE’s snapshot/restore API. This branches a shared mid-trajectory state into a factual rollout and perturbed alternatives (Section G.4.1). Each row is one seed of the catapult level with a different oracle solution strategy. The leftmost column shows the initial scene for reference; the teal curve is the green ball’s trajectory, shown up to the branch point in the Branch point column and continuing … view at source ↗

**Figure 4.** Figure 4: How the evolved skill bank of HExA guides the agent to solve [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Reward-guided skill accumulation makes HEXA much stronger on catapult, the hardest level. This experiment tests whether an agent can improve by turning past attempts into reusable, reward-guided skills. On catapult, Claude Sonnet solves only 8.0% of seeds with a standard REACT loop, while HEXA reaches 67.3% and uses fewer iterations per solve. Left: final solve rate and average iterations. Right: cumulativ… view at source ↗

**Figure 6.** Figure 6: Cumulative average turns per seed for every HEXA variant against the baseline (Claude Sonnet); lower is more efficient. On both levels the Off2On Evolving configuration converges to the lowest cost per seed, while the iterative and pure-online variants remain higher, indicating that HEXA’s gains come from guiding search more efficiently with accumulated experience rather than from spending more computation… view at source ↗

**Figure 7.** Figure 7: Cross-level skill transfer with no target trajectories. Each target is solved using only skills synthesized from source-level banks; labels report target success and matched REACT baselines. Multi-source transfer to catapult yields the largest gain (+36 pp). Qualitatively, the evolved catapult bank shows this hierarchical nature (Appendix E.4): experiments at x=0.5 where r=1.0 fell short and r=1.5 struck … view at source ↗

**Figure 8.** Figure 8: In-context skill evolution learns faster than GRPO at the same interaction budget. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: The HEXA actor-evolver loop. The actor receives distilled skills in its system prompt and runs on physics puzzles, producing trajectories annotated with rewards (successes ✓, failures ✗). The Skill Evolver analyzes these trajectories and curates the evolving Skill Bank: skills (max M) are merged, pruned, or promoted by reward signal, and common mistakes (max N) are updated accordingly. Each skill carries a… view at source ↗

**Figure 10.** Figure 10: Tool interface between the HEXA agent and the Interphyre environment. The agent issues two classes of tool calls: State Information Tools (left) query the scene without advancing the simulation—returning object positions, gap widths, relative ball positions, and level-specific geometry—and Simulation Tools (right) place the red ball at (x, y, r) and run the physics engine, returning SUCCESS/FAILURE with f… view at source ↗

**Figure 11.** Figure 11: Cumulative solve rate as seeds are evaluated [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Cumulative average turns per seed for every HEXA variant against the baseline (Claude Sonnet); lower is more efficient. On both levels the Off2On Evolving configuration converges to the lowest cost per seed, while the iterative and pure-online variants remain higher, indicating that HEXA’s gains come from guiding search more efficiently with accumulated experience rather than from spending more computatio… view at source ↗

**Figure 13.** Figure 13: HExA shows consistent improvement on small open LLMs as well. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Interphyre’s snapshot/restore API branches a shared mid-trajectory state into a factual rollout and perturbed alternatives (Section G.4.1). Each row is one seed of the catapult level with a different oracle solution strategy. The leftmost column shows the initial scene for reference; the teal curve is the green ball’s trajectory, shown up to the branch point in the Branch point column and continuing throu… view at source ↗

**Figure 15.** Figure 15: Eight of the 25 canonical levels in the Interphyre curriculum, shown at initial state. The curriculum is hand-authored to span a range of physical phenomena: down_to_earth (gravity-driven drop past an obstruction), catapult (impulse transfer), basket_case (container avoidance), marble_race (rolling dynamics along a track), seesaw (lever balance), the_cradle (pendulum), keyhole (aperture navigation), and t… view at source ↗

**Figure 16.** Figure 16: The catapult level (seed 8) under four physics configurations, each set by a single SimulationConfig field, with the same oracle action placed in every panel. Ghost trails show the green ball’s in-flight trajectory. Default physics yields success. Reduced gravity preserves the qualitative mechanism but extends the arc, producing a near miss: the ball clears the basket opening rather than dropping in. With… view at source ↗

**Figure 17.** Figure 17: Branching protocol for [PITH_FULL_IMAGE:figures/full_fig_p066_17.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HExA reports large gains on a new physics benchmark via hierarchical experiment design and skill transfer, but the results rest on unablated scaffolding that makes the causal contribution hard to assess.

read the letter

The paper's main move is to frame active experimentation as an in-context loop: the agent designs interventions in a 2D physics simulator, extracts reusable skills into a library, and applies them to harder tasks or queries. On their new Interphyre benchmark it lifts Claude Sonnet from 2% to 77% success on the hardest levels and gets 44% zero-shot transfer from easier levels. That transfer number is the cleanest part of the story if it survives scrutiny.

What stands out is the problem setup itself. Standard ReAct or Reflexion agents are shown to fail when parametric knowledge is insufficient, and the authors respond with a training-free method that builds a composable skill set on the fly. The benchmark is also new, built on PHYRE, so it targets a concrete gap in long-horizon physical reasoning.

The soft spot is the missing controls. The abstract gives no error bars, no trial counts, and no ablation that keeps prompt length, tool schema, and context management fixed while removing only the hierarchy and skill-acquisition steps. If a flat agent with the same scaffolding already reaches 40% or more, the headline attribution to the HExA loop does not follow. The stress-test note flags exactly this issue, and nothing in the provided text resolves it.

This is the sort of paper that would interest people working on scientific agents or robotics, but only after the methods section and ablations are filled in. The thinking is straightforward and the framing is distinct from prior tool-calling work, yet the current evidence is too thin to treat the numbers as settled.

I would bring it to a reading group to walk through the architecture and the benchmark definition. I would not cite it in its present form. It deserves peer review so the authors can supply the missing controls and let referees judge whether the hierarchy is load-bearing.

Referee Report

2 major / 1 minor

Summary. The paper introduces Hierarchical Experimentalist Agents (HExA), a training-free in-context framework in which LLMs iteratively design and refine experiments, acquire a reusable library of composable skills from simulation feedback, and integrate evidence to solve novel long-horizon tasks. On the newly introduced Interphyre benchmark (built on PHYRE), the method is reported to raise Claude Sonnet 4.6 success from 2 % to as high as 77 % on the hardest levels and to achieve 44 % success via zero-shot transfer of skills learned only on easier levels; it also outperforms ReAct and Reflexion baselines across several models.

Significance. If the reported gains prove robust and causally attributable to the hierarchical experimentalist loop plus skill library, the work would constitute a meaningful step toward agents that can discover domain-specific knowledge through active experimentation rather than relying solely on parametric memory or retrieval. The Interphyre benchmark itself supplies a concrete, tool-calling testbed for physics-based hypothesis testing that is currently absent from most agent evaluations.

major comments (2)

[Abstract / Results] Abstract and results sections: the headline performance figures (2 % o 77 % and 44 % transfer) are presented without any report of the number of independent trials, standard errors, or variance across random seeds or level instantiations, so the statistical reliability of the central empirical claim cannot be assessed from the manuscript.
[Evaluation] Evaluation / experimental setup: no ablation is described that keeps total prompt length, tool-calling schema, and context-management format identical while removing only the hierarchical planning loop and the skill-library acquisition mechanism; therefore the attribution of gains specifically to HExA rather than to changes in scaffolding remains unverified.

minor comments (1)

[Introduction] A short paragraph clarifying how Interphyre extends or differs from the original PHYRE task distribution would help readers situate the new benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and strengthen the attribution of gains to HExA. We address each point below and commit to revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results sections: the headline performance figures (2 % to 77 % and 44 % transfer) are presented without any report of the number of independent trials, standard errors, or variance across random seeds or level instantiations, so the statistical reliability of the central empirical claim cannot be assessed from the manuscript.

Authors: We agree that statistical details are necessary to evaluate reliability. All reported figures were obtained from 5 independent random seeds per level across the full Interphyre test set. In the revised manuscript we will report means, standard errors, and per-level variance both in the abstract and in the results tables. revision: yes
Referee: [Evaluation] Evaluation / experimental setup: no ablation is described that keeps total prompt length, tool-calling schema, and context-management format identical while removing only the hierarchical planning loop and the skill-library acquisition mechanism; therefore the attribution of gains specifically to HExA rather than to changes in scaffolding remains unverified.

Authors: We acknowledge the absence of a tightly controlled ablation that holds prompt length, tool schema, and context management fixed while ablating only the hierarchical loop and skill library. We will add this ablation in the revision to isolate the contribution of those two components. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims on new benchmark

full rationale

The paper introduces the HExA framework and the Interphyre benchmark, then reports empirical success rates (e.g., Claude Sonnet 4.6 improving from 2% to 77% on hardest levels, plus 44% zero-shot transfer). No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. All load-bearing claims are direct experimental outcomes rather than quantities that reduce to their own inputs by construction, self-citation chains, or renamed ansatzes. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that an LLM can reliably propose and interpret experiments in a simulator without external supervision; no free parameters or invented physical entities are described.

axioms (1)

domain assumption LLMs can design and refine query-relevant experiments in-context using only simulator feedback
This premise is required for the iterative loop to function and is invoked throughout the framework description.

pith-pipeline@v0.9.1-grok · 5871 in / 1263 out tokens · 37354 ms · 2026-06-30T07:27:17.622028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 5 canonical work pages · 1 internal anchor

[1]

61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres

URLhttps://arxiv.org/abs/2507.15550. 61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models, 2024. URL https://arxiv.org/ abs/2411.08027. 61 Wei Chow, Jiajun Mao, Bowen Li, Daniel Seita, Vitor Guizilini, and Yue Wang. PhysBench: Benchmarking and enhancing vision-la...

work page arXiv 2024
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. 1 Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025. 5 15 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
[3]

Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026

24, 58 OpenAI. Gpt-5 technical report. 2025. URL https://cdn.openai.com/gpt-5-system-card. pdf. 1 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christ...

work page arXiv 2025
[4]

Spies, William Edwards, Michael I

2 Alex F. Spies, William Edwards, Michael I. Ivanitskiy, et al. Transformers use causal world models in maze-solving tasks. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2412.11867. 61 Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schm...

work page arXiv 2025
[5]

2, 3 Qwen Team

URLhttps://arxiv.org/abs/2107.12808. 2, 3 Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026. 1 Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning env...

work page arXiv 2026
[6]

identify which source skills encode principles that are structurally relevant to the target (based on shared physical or procedural primitives),
[7]

re-ground each selected principle in the entities and mechanics of the target scene, and
[8]

No contact events recorded

calibrate reward based on how directly the principle transfers (skills corroborated by multiple source banks receive higher reward). The synthesised bank Kℓ∗ is then injected into the actor’s context via the sameRETRIEVER mecha- nism as within-task skills. This enables zero-shot transfer: the actor attempts the target task with the benefit of cross-task s...
[9]

Write the reflection now

{prior_reflection_1} ... Write the reflection now. Be specific and concise (<=5 sentences). Trial 2 conditioning.The returned reflection text is appended to the level’s system prompt under a heading## Reflexion memory before trial 2 starts. The actor sees this block exactly once at trial 2’s first turn, in addition to the unchanged level system prompt and...
[10]

The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim

A single scripted call toget_level_state via the same INTERPHYREtoolkit used by REACT. The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim
[11]

x": <X>,

One non-streaming Claude Sonnet 4.6 call. The conversation has a level-specific system prompt and a single user message that embeds the scene table from step 1 and demands the final (x, y, r) placement as JSON. The harness parses the JSON, runs simulate_action once, and records success/failure. The model is given no read tools, no probe tools (simulate_pa...

2024
[15]

Arguments: None Usage: Action: get_contact_log

get_contact_log 39 Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log
[17]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...
[19]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

2000
[22]

x": 0.5,

simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps), returns a detailed error with how far to move the ball. Arguments: x (fl...
[25]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...
[26]

x": 1.2,

predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object’s name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...
[27]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[28]

(2) short range -> increase y to 0.9 or shift x to 0.4--0.45

[Strategy: Three-Tier Fallback Sequence for Failed Primary Placement] When primary (x=0.5, y=0.4, r=1.5) fails: (1) ceiling hit -> immediately try x=0.3, y=0.4, r=1.5 (gray-ball-safe first choice); if still ceiling hit, try x=0.3, y=-0.3, r=2.0 but verify no overlap first. (2) short range -> increase y to 0.9 or shift x to 0.4--0.45. (3) persistent failur...
[29]

[Strategy: x=0.3 Is the Ceiling-Escape x-Position; y=0.4 Is the Gray-Ball-Safe Default Height] Shifting x from 0.5 to 0.3 flattens the green ball’s launch arc by changing the arm contact point --- this x-shift is the primary ceiling-escape mechanism, not lowering y. When x=0.3, y=-0.3, r=2.0 is invalid due to gray_ball pivot overlap (the pivot at r~=0.70 ...
[30]

The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone

[Strategy: x~=0.5 Is the Primary Catapult Launch Sweet Spot] Placing the red ball at x=0.5 on the catapult arm produces a consistent rightward launch arc. The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone. Only deviate to x=0.3 when a ceiling hit is observe...
[31]

Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket

[Strategy: Use Large Radius (>=1.5) for Sufficient Launch Energy] The catapult arm is a lever --- the red ball’s mass (proportional tor^3) determines angular momentum imparted to the arm. Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket. r=1.5 is the minimum reliable threshold; r=2.0 adds negligible ad...
[32]

When the arm sits higher (y > -1.5), increase to y=0.9

[Strategy: y=0.4 Is Default Drop Height; Scale Up to y=0.9 When Arm Sits Higher] When the arm y <= -1.5, y=0.4 gives adequate fall distance for the red ball to build approach speed. When the arm sits higher (y > -1.5), increase to y=0.9. Critically, y=0.4 also serves as the safe ceiling-escape fallback height --- even when using x=0.3 for arc flattening, ...
[33]

predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates

[Strategy: predict_first_contact Is Essential for Variable-Arm Seeds] The catapult arm’s y-position varies across seeds, making the overlap boundary unpredictable. predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates. It also catches gray_ball ...
[34]

When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit

[Strategy: Ceiling Blocker Lethality Depends on Its x-Position and Arc Angle] The static black ball near y~=4.6 varies in x-position across seeds (x~=-3.85 to x~=-1.14). When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit. When the blocker is at x < -3.5 (far left), x=0.3 launches allow the green ball to arc up and even bounce ...
[35]

The stable launch zone is x~=0.0 to 0.5

[Strategy: x-Position Fine-Tunes Landing Range; Bifurcation Zones at x<=-0.3 and x=0.7--1.5] Small x shifts (0.1--0.2 units) cause large changes in green ball landing position. The stable launch zone is x~=0.0 to 0.5. Positions x<=-0.3 and x=0.7--1.5 are bifurcation zones where trajectory is chaotic and small changes produce unpredictable outcomes. Within...
[36]

Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit

[Strategy: Radius Plateau at x=0.5: r Beyond 1.5 Yields No Additional Range] At x~=0.5, the catapult arm’s rotation saturates --- r=1.5 and r=2.0 produce identical green ball landing positions. Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit. Intermediate radii (r=1.1--1....
[37]

This occurs because ceiling clearance is determined by launch angle (set by x-position), not energy level

[Strategy: When Ceiling-Range Tradeoff Is Unsolvable by Radius Tuning, Shift x to 0.1--0.3] When r=1.5 hits the ceiling and r=1.0 falls short (Deltax gap > 1.0 unit), interpolating intermediate radii (r=1.1--1.4) does not resolve the conflict --- they also hit the ceiling or remain short. This occurs because ceiling clearance is determined by launch angle...
[38]

After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements

**Agent fixates on a single launch mechanism (catapult arm) and micro-tunes x/y/radius around the same narrow region without escaping the local solution space.** Why it happens: The catapult arm is the most obvious mechanism. After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements. How to avoid: After 2 failures...
[39]

But lever rotation saturates before additional mass converts to green ball velocity at x=0.5

**Agent increases radius at fixed x=0.5 expecting more range, but arm rotation is saturated and result is identical.** Why it happens: Linear energy intuition: more mass = more energy = more range. But lever rotation saturates before additional mass converts to green ball velocity at x=0.5. How to avoid: When r=1.5 fails at x=0.5, never try r=2.0 at the s...
[40]

But more energy produces a higher-arcing launch, worsening the ceiling hit

**Agent adjusts y in the wrong direction after ceiling hits: increases y to 2.0+ (higher drop) instead of shifting x to 0.3 or using y=0.4 at x=0.3.** Why it happens: Higher drop = more kinetic energy at impact. But more energy produces a higher-arcing launch, worsening the ceiling hit. How to avoid: After a ceiling hit, shift x left to 0.3 (keeping y=0.4...
[41]

How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately reset to the stable zone x=0.2--0.4

**Agent binary-searches in the x=-0.3 to -0.5 or x=0.8--1.5 bifurcation zones where green ball trajectory is chaotic.** Why it happens: Small x adjustments feel like fine-tuning, but in bifurcation zones they produce wildly different outcomes that appear to need more fine-tuning. How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately res...
[42]

But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact

**Agent places deflector balls near the basket (x~=3.5--4.5) hoping to push the blue ball toward the green ball’s landing zone.** Why it happens: When the green ball consistently lands short, moving the target (blue ball) seems logical. But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact. How to avoid: Fix the g...
[43]

Review the LEARNED PHYSICS SKILLS listed in the system prompt above
[44]

Identify which skill applies to the failure you just observed
[45]

Explicitly state which skill you are applying and why in your next Thought. E.6 HEXA trajectory (catapult seed 45, 6 iterations, SUCCESS) --- Step 1 --- Thought: Let me start by describing the scene geometry to understand the layout before attempting any placement. Action: describe_scene_geometry({}) Observation: === Scene Geometry === World bounds: x in ...

2000
[46]

Paired counterfactual evaluation follows: the same agent, on the same scene, under matched and perturbed physical conditions

An intervention API with snapshot/restore semantics.Any running simulation can be snapshot- ted at an arbitrary point (on a physics event, a contact trigger, or a fixed timestep) and restored into two or more branches that diverge under controlled perturbations (impulses, forces, parame- ter changes, object additions or removals). Paired counterfactual ev...
[47]

Extensible level authoring as Python code.Levels are Python classes built from a composable object API, with tunable physics parameters (gravity, friction, restitution, density) and user-defined success conditions. Like Toybox’s reimplementation of Atari games (Foley et al., 2018), Interphyre reimplements the 2D physics puzzle paradigm established by PHYR...

2018
[48]

sample, simulate, update

LLM-native interfaces and interpretability-ready data generation.The simulator and inter- vention API are exposed as a tool-call surface that an LLM agent can invoke directly. The same interfaces support standalone generation of paired (factual, counterfactual) trajectory data for downstream interpretability and offline reinforcement learning pipelines. A...

2019
[49]

green_ball

uses LLM agents to actively probe interactive physics simulations and formulate hypotheses, the closest existing paradigm to Interphyre, though PhysGym targets discovery of physical equations rather than intuitive task-solving. LLMPhy (Cherian et al., 2024) demonstrated a simulator-in-the- loop framework where LLMs iteratively estimate physical parameters...

2024
[53]

x": 0.5,

simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_st...
[54]

Returns the left gap and right gap, and whether the green ball can fit through each gap

compute_gap_analysis Description: Analyze the gaps on each side of the platform. Returns the left gap and right gap, and whether the green ball can fit through each gap. Arguments: None Usage: Action: compute_gap_analysis
[55]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[60]

Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction

compute_relative_positions Description: Analyze the positions of the green and blue balls. Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction. Arguments: None Usage: Action: compute_relative_positions
[61]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[66]

Returns the center of the ramp

get_ramp_center Description: Analyze the pass_the_parcel setup. Returns the center of the ramp. 70 Arguments: None Usage: Action: get_ramp_center
[67]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[68]

No prescriptive advice; you interpret the layout to form a strategy

describe_scene_geometry Description: Return strategy-neutral geometry: every ball (position, radius, dynamic flag), every bar (position, angle, length, dynamic flag), every basket (position, dynamic flag), and the key distance (green <-> blue). No prescriptive advice; you interpret the layout to form a strategy. Arguments: None Usage: Action: describe_sce...
[73]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball's (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest ( capped at ~600 steps). Use this when you only care about WHERE the green ba...
[74]

x": 1.2,

predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object's name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...
[75]

green_ball

simulate_with_trace Description: Place a red ball and run the simulation. Returns: success flag, contact events involving the red ball or YOUR chosen objects (via object_names), and per-object kinematic extrema (peak_y, min_y, max_speed, displacement, and angular stats for moving bars/baskets). You choose which objects to trace---e.g., ["green_ball"] to s...
[76]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[81]

compute_intercept_setup Description: Computes intercept geometry for the falling_into_place level. Returns which platform the green ball is on, which direction it must travel to reach the jar, the platform edge it must cross, the gap center, and the estimated time before the jar reaches platform height. Arguments: None Usage: Action: compute_intercept_setup
[82]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[87]

Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket

compute_basket_analysis Description: Analyze the basket case setup. Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket. Arguments: None Usage: Action: compute_basket_analysis 75
[88]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[89]

Arguments: None Usage: Action: get_level_state

get_level_state 76 Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state
[93]

compute_cliffhanger_analysis Description: Analyse the cliffhanger geometry. Returns the green bar's centre, length, and the (x, y) coordinates of its bottom point (resting on the platform) and top point ( opposite end); the platform's left/right extents and top-surface y; the ceiling y and purple-ground y; the bar's distance to each platform edge; which e...
[94]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[95]

Arguments: None Usage: Action: get_level_state

get_level_state Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state
[96]

x": 0.5,

simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps ), returns a detailed error with how far to move the ball. Arguments: x (f...
[97]

Arguments: None Usage: Action: get_contact_log

get_contact_log Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log
[98]

x": 0.5,

simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_step": 50}
[99]

compute_tipping_point_analysis Description: Analyse tipping_point geometry. Returns the green bar's centre, length, angle, and the (x, y) coordinates of its top and bottom endpoints; the basket's centre and floor; the purple wall's x position and its top/bottom y; the purple wall's side relative to the green bar (LEFT or RIGHT); the horizontal distance fr...
[100]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
[103]

good steps within bad trajectories

A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a valid placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps within ...
[105]

Hard constraints: - Maximum {max_skills} total skills for this level - Maximum {max_mistakes} total mistakes for this level
[106]

is_new": false - If it's a NEW skill extracted from the new trajectories: set

For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false - If it's a NEW skill extracted from the new trajectories: set "is_new": true - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration) - Include "confidence": a float in [0.1, 1.0] representing your co...
[108]

For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence) - Universality (applies to multiple seed conditions = higher confidence) - Clarity and actionability of the principle
[109]

If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)

Do not include duplicate skills. If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)
[110]

- Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories

Remove skills that are: - Redundant or subsumed by other skills. - Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories
[111]

skills": [ {{

Do not remove mistakes unless the new trajectories show they're no longer common. OUTPUT JSON OBJECT: {{ "skills": [ {{ "title": "<short name of skill>", "principle": "<2-3 sentence physics insight>", "when_to_apply": "<condition for applicability>", "example": "<optional concrete coordinate example>", "source_seeds": [<seed numbers>], "confidence": <floa...
[112]

What exactly the agent did wrong
[113]

WHY the agent made this error (what broken causal belief led to it)
[114]

good steps within bad trajectories

A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a promising placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps wit...
[115]

Output the COMPLETE FINAL skill bank (not a diff) -- include both retained existing skills and any new ones
[116]

- Maximum {max_mistakes} total mistakes for this level

Hard constraints: - Maximum {max_skills} total skills for this level. - Maximum {max_mistakes} total mistakes for this level
[117]

is_new": false. - If it's a NEW skill extracted from the new trajectories: set

For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false. - If it's a NEW skill extracted from the new trajectories: set "is_new": true. - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration). - Include "confidence": a float in [0.1, 1.0] representing your...
[118]

For retained skills, preserve their existing confidence values (they've been validated)
[119]

- Universality (applies across multiple seed conditions = higher confidence)

For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence). - Universality (applies across multiple seed conditions = higher confidence). - Clarity and actionability of the principle

Showing first 80 references.

[1] [1]

61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres

URLhttps://arxiv.org/abs/2507.15550. 61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models, 2024. URL https://arxiv.org/ abs/2411.08027. 61 Wei Chow, Jiajun Mao, Bowen Li, Daniel Seita, Vitor Guizilini, and Yue Wang. PhysBench: Benchmarking and enhancing vision-la...

work page arXiv 2024

[2] [2]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. 1 Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025. 5 15 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025

[3] [3]

Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026

24, 58 OpenAI. Gpt-5 technical report. 2025. URL https://cdn.openai.com/gpt-5-system-card. pdf. 1 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christ...

work page arXiv 2025

[4] [4]

Spies, William Edwards, Michael I

2 Alex F. Spies, William Edwards, Michael I. Ivanitskiy, et al. Transformers use causal world models in maze-solving tasks. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2412.11867. 61 Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schm...

work page arXiv 2025

[5] [5]

2, 3 Qwen Team

URLhttps://arxiv.org/abs/2107.12808. 2, 3 Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026. 1 Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning env...

work page arXiv 2026

[6] [6]

identify which source skills encode principles that are structurally relevant to the target (based on shared physical or procedural primitives),

[7] [7]

re-ground each selected principle in the entities and mechanics of the target scene, and

[8] [8]

No contact events recorded

calibrate reward based on how directly the principle transfers (skills corroborated by multiple source banks receive higher reward). The synthesised bank Kℓ∗ is then injected into the actor’s context via the sameRETRIEVER mecha- nism as within-task skills. This enables zero-shot transfer: the actor attempts the target task with the benefit of cross-task s...

[9] [9]

Write the reflection now

{prior_reflection_1} ... Write the reflection now. Be specific and concise (<=5 sentences). Trial 2 conditioning.The returned reflection text is appended to the level’s system prompt under a heading## Reflexion memory before trial 2 starts. The actor sees this block exactly once at trial 2’s first turn, in addition to the unchanged level system prompt and...

[10] [10]

The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim

A single scripted call toget_level_state via the same INTERPHYREtoolkit used by REACT. The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim

[11] [11]

x": <X>,

One non-streaming Claude Sonnet 4.6 call. The conversation has a level-specific system prompt and a single user message that embeds the scene table from step 1 and demands the final (x, y, r) placement as JSON. The harness parses the JSON, runs simulate_action once, and records success/failure. The model is given no read tools, no probe tools (simulate_pa...

2024

[12] [15]

Arguments: None Usage: Action: get_contact_log

get_contact_log 39 Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log

[13] [17]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...

[14] [19]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

2000

[15] [22]

x": 0.5,

simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps), returns a detailed error with how far to move the ball. Arguments: x (fl...

[16] [25]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...

[17] [26]

x": 1.2,

predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object’s name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...

[18] [27]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[19] [28]

(2) short range -> increase y to 0.9 or shift x to 0.4--0.45

[Strategy: Three-Tier Fallback Sequence for Failed Primary Placement] When primary (x=0.5, y=0.4, r=1.5) fails: (1) ceiling hit -> immediately try x=0.3, y=0.4, r=1.5 (gray-ball-safe first choice); if still ceiling hit, try x=0.3, y=-0.3, r=2.0 but verify no overlap first. (2) short range -> increase y to 0.9 or shift x to 0.4--0.45. (3) persistent failur...

[20] [29]

[Strategy: x=0.3 Is the Ceiling-Escape x-Position; y=0.4 Is the Gray-Ball-Safe Default Height] Shifting x from 0.5 to 0.3 flattens the green ball’s launch arc by changing the arm contact point --- this x-shift is the primary ceiling-escape mechanism, not lowering y. When x=0.3, y=-0.3, r=2.0 is invalid due to gray_ball pivot overlap (the pivot at r~=0.70 ...

[21] [30]

The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone

[Strategy: x~=0.5 Is the Primary Catapult Launch Sweet Spot] Placing the red ball at x=0.5 on the catapult arm produces a consistent rightward launch arc. The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone. Only deviate to x=0.3 when a ceiling hit is observe...

[22] [31]

Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket

[Strategy: Use Large Radius (>=1.5) for Sufficient Launch Energy] The catapult arm is a lever --- the red ball’s mass (proportional tor^3) determines angular momentum imparted to the arm. Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket. r=1.5 is the minimum reliable threshold; r=2.0 adds negligible ad...

[23] [32]

When the arm sits higher (y > -1.5), increase to y=0.9

[Strategy: y=0.4 Is Default Drop Height; Scale Up to y=0.9 When Arm Sits Higher] When the arm y <= -1.5, y=0.4 gives adequate fall distance for the red ball to build approach speed. When the arm sits higher (y > -1.5), increase to y=0.9. Critically, y=0.4 also serves as the safe ceiling-escape fallback height --- even when using x=0.3 for arc flattening, ...

[24] [33]

predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates

[Strategy: predict_first_contact Is Essential for Variable-Arm Seeds] The catapult arm’s y-position varies across seeds, making the overlap boundary unpredictable. predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates. It also catches gray_ball ...

[25] [34]

When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit

[Strategy: Ceiling Blocker Lethality Depends on Its x-Position and Arc Angle] The static black ball near y~=4.6 varies in x-position across seeds (x~=-3.85 to x~=-1.14). When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit. When the blocker is at x < -3.5 (far left), x=0.3 launches allow the green ball to arc up and even bounce ...

[26] [35]

The stable launch zone is x~=0.0 to 0.5

[Strategy: x-Position Fine-Tunes Landing Range; Bifurcation Zones at x<=-0.3 and x=0.7--1.5] Small x shifts (0.1--0.2 units) cause large changes in green ball landing position. The stable launch zone is x~=0.0 to 0.5. Positions x<=-0.3 and x=0.7--1.5 are bifurcation zones where trajectory is chaotic and small changes produce unpredictable outcomes. Within...

[27] [36]

Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit

[Strategy: Radius Plateau at x=0.5: r Beyond 1.5 Yields No Additional Range] At x~=0.5, the catapult arm’s rotation saturates --- r=1.5 and r=2.0 produce identical green ball landing positions. Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit. Intermediate radii (r=1.1--1....

[28] [37]

This occurs because ceiling clearance is determined by launch angle (set by x-position), not energy level

[Strategy: When Ceiling-Range Tradeoff Is Unsolvable by Radius Tuning, Shift x to 0.1--0.3] When r=1.5 hits the ceiling and r=1.0 falls short (Deltax gap > 1.0 unit), interpolating intermediate radii (r=1.1--1.4) does not resolve the conflict --- they also hit the ceiling or remain short. This occurs because ceiling clearance is determined by launch angle...

[29] [38]

After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements

**Agent fixates on a single launch mechanism (catapult arm) and micro-tunes x/y/radius around the same narrow region without escaping the local solution space.** Why it happens: The catapult arm is the most obvious mechanism. After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements. How to avoid: After 2 failures...

[30] [39]

But lever rotation saturates before additional mass converts to green ball velocity at x=0.5

**Agent increases radius at fixed x=0.5 expecting more range, but arm rotation is saturated and result is identical.** Why it happens: Linear energy intuition: more mass = more energy = more range. But lever rotation saturates before additional mass converts to green ball velocity at x=0.5. How to avoid: When r=1.5 fails at x=0.5, never try r=2.0 at the s...

[31] [40]

But more energy produces a higher-arcing launch, worsening the ceiling hit

**Agent adjusts y in the wrong direction after ceiling hits: increases y to 2.0+ (higher drop) instead of shifting x to 0.3 or using y=0.4 at x=0.3.** Why it happens: Higher drop = more kinetic energy at impact. But more energy produces a higher-arcing launch, worsening the ceiling hit. How to avoid: After a ceiling hit, shift x left to 0.3 (keeping y=0.4...

[32] [41]

How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately reset to the stable zone x=0.2--0.4

**Agent binary-searches in the x=-0.3 to -0.5 or x=0.8--1.5 bifurcation zones where green ball trajectory is chaotic.** Why it happens: Small x adjustments feel like fine-tuning, but in bifurcation zones they produce wildly different outcomes that appear to need more fine-tuning. How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately res...

[33] [42]

But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact

**Agent places deflector balls near the basket (x~=3.5--4.5) hoping to push the blue ball toward the green ball’s landing zone.** Why it happens: When the green ball consistently lands short, moving the target (blue ball) seems logical. But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact. How to avoid: Fix the g...

[34] [43]

Review the LEARNED PHYSICS SKILLS listed in the system prompt above

[35] [44]

Identify which skill applies to the failure you just observed

[36] [45]

Explicitly state which skill you are applying and why in your next Thought. E.6 HEXA trajectory (catapult seed 45, 6 iterations, SUCCESS) --- Step 1 --- Thought: Let me start by describing the scene geometry to understand the layout before attempting any placement. Action: describe_scene_geometry({}) Observation: === Scene Geometry === World bounds: x in ...

2000

[37] [46]

Paired counterfactual evaluation follows: the same agent, on the same scene, under matched and perturbed physical conditions

An intervention API with snapshot/restore semantics.Any running simulation can be snapshot- ted at an arbitrary point (on a physics event, a contact trigger, or a fixed timestep) and restored into two or more branches that diverge under controlled perturbations (impulses, forces, parame- ter changes, object additions or removals). Paired counterfactual ev...

[38] [47]

Extensible level authoring as Python code.Levels are Python classes built from a composable object API, with tunable physics parameters (gravity, friction, restitution, density) and user-defined success conditions. Like Toybox’s reimplementation of Atari games (Foley et al., 2018), Interphyre reimplements the 2D physics puzzle paradigm established by PHYR...

2018

[39] [48]

sample, simulate, update

LLM-native interfaces and interpretability-ready data generation.The simulator and inter- vention API are exposed as a tool-call surface that an LLM agent can invoke directly. The same interfaces support standalone generation of paired (factual, counterfactual) trajectory data for downstream interpretability and offline reinforcement learning pipelines. A...

2019

[40] [49]

green_ball

uses LLM agents to actively probe interactive physics simulations and formulate hypotheses, the closest existing paradigm to Interphyre, though PhysGym targets discovery of physical equations rather than intuitive task-solving. LLMPhy (Cherian et al., 2024) demonstrated a simulator-in-the- loop framework where LLMs iteratively estimate physical parameters...

2024

[41] [53]

x": 0.5,

simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_st...

[42] [54]

Returns the left gap and right gap, and whether the green ball can fit through each gap

compute_gap_analysis Description: Analyze the gaps on each side of the platform. Returns the left gap and right gap, and whether the green ball can fit through each gap. Arguments: None Usage: Action: compute_gap_analysis

[43] [55]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[44] [60]

Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction

compute_relative_positions Description: Analyze the positions of the green and blue balls. Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction. Arguments: None Usage: Action: compute_relative_positions

[45] [61]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[46] [66]

Returns the center of the ramp

get_ramp_center Description: Analyze the pass_the_parcel setup. Returns the center of the ramp. 70 Arguments: None Usage: Action: get_ramp_center

[47] [67]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[48] [68]

No prescriptive advice; you interpret the layout to form a strategy

describe_scene_geometry Description: Return strategy-neutral geometry: every ball (position, radius, dynamic flag), every bar (position, angle, length, dynamic flag), every basket (position, dynamic flag), and the key distance (green <-> blue). No prescriptive advice; you interpret the layout to form a strategy. Arguments: None Usage: Action: describe_sce...

[49] [73]

x": 1.2,

trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball's (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest ( capped at ~600 steps). Use this when you only care about WHERE the green ba...

[50] [74]

x": 1.2,

predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object's name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...

[51] [75]

green_ball

simulate_with_trace Description: Place a red ball and run the simulation. Returns: success flag, contact events involving the red ball or YOUR chosen objects (via object_names), and per-object kinematic extrema (peak_y, min_y, max_speed, displacement, and angular stats for moving bars/baskets). You choose which objects to trace---e.g., ["green_ball"] to s...

[52] [76]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[53] [81]

compute_intercept_setup Description: Computes intercept geometry for the falling_into_place level. Returns which platform the green ball is on, which direction it must travel to reach the jar, the platform edge it must cross, the gap center, and the estimated time before the jar reaches platform height. Arguments: None Usage: Action: compute_intercept_setup

[54] [82]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[55] [87]

Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket

compute_basket_analysis Description: Analyze the basket case setup. Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket. Arguments: None Usage: Action: compute_basket_analysis 75

[56] [88]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[57] [89]

Arguments: None Usage: Action: get_level_state

get_level_state 76 Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state

[58] [93]

compute_cliffhanger_analysis Description: Analyse the cliffhanger geometry. Returns the green bar's centre, length, and the (x, y) coordinates of its bottom point (resting on the platform) and top point ( opposite end); the platform's left/right extents and top-surface y; the ceiling y and purple-ground y; the bar's distance to each platform edge; which e...

[59] [94]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[60] [95]

Arguments: None Usage: Action: get_level_state

get_level_state Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state

[61] [96]

x": 0.5,

simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps ), returns a detailed error with how far to move the ball. Arguments: x (f...

[62] [97]

Arguments: None Usage: Action: get_contact_log

get_contact_log Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log

[63] [98]

x": 0.5,

simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_step": 50}

[64] [99]

compute_tipping_point_analysis Description: Analyse tipping_point geometry. Returns the green bar's centre, length, angle, and the (x, y) coordinates of its top and bottom endpoints; the basket's centre and floor; the purple wall's x position and its top/bottom y; the purple wall's side relative to the green bar (LEFT or RIGHT); the horizontal distance fr...

[65] [100]

x": 0.5,

finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...

[66] [103]

good steps within bad trajectories

A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a valid placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps within ...

[67] [105]

Hard constraints: - Maximum {max_skills} total skills for this level - Maximum {max_mistakes} total mistakes for this level

[68] [106]

is_new": false - If it's a NEW skill extracted from the new trajectories: set

For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false - If it's a NEW skill extracted from the new trajectories: set "is_new": true - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration) - Include "confidence": a float in [0.1, 1.0] representing your co...

[69] [108]

For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence) - Universality (applies to multiple seed conditions = higher confidence) - Clarity and actionability of the principle

[70] [109]

If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)

Do not include duplicate skills. If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)

[71] [110]

- Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories

Remove skills that are: - Redundant or subsumed by other skills. - Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories

[72] [111]

skills": [ {{

Do not remove mistakes unless the new trajectories show they're no longer common. OUTPUT JSON OBJECT: {{ "skills": [ {{ "title": "<short name of skill>", "principle": "<2-3 sentence physics insight>", "when_to_apply": "<condition for applicability>", "example": "<optional concrete coordinate example>", "source_seeds": [<seed numbers>], "confidence": <floa...

[73] [112]

What exactly the agent did wrong

[74] [113]

WHY the agent made this error (what broken causal belief led to it)

[75] [114]

good steps within bad trajectories

A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a promising placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps wit...

[76] [115]

Output the COMPLETE FINAL skill bank (not a diff) -- include both retained existing skills and any new ones

[77] [116]

- Maximum {max_mistakes} total mistakes for this level

Hard constraints: - Maximum {max_skills} total skills for this level. - Maximum {max_mistakes} total mistakes for this level

[78] [117]

is_new": false. - If it's a NEW skill extracted from the new trajectories: set

For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false. - If it's a NEW skill extracted from the new trajectories: set "is_new": true. - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration). - Include "confidence": a float in [0.1, 1.0] representing your...

[79] [118]

For retained skills, preserve their existing confidence values (they've been validated)

[80] [119]

- Universality (applies across multiple seed conditions = higher confidence)

For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence). - Universality (applies across multiple seed conditions = higher confidence). - Clarity and actionability of the principle