Recognition: 2 theorem links
· Lean TheoremMaD Physics: Evaluating information seeking under constraints in physical environments
Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3
The pith
MaD Physics benchmark evaluates agents on inferring physical laws from constrained measurements using altered laws.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MaD Physics is a benchmark consisting of three environments based on distinct physical laws with alterations to reduce knowledge contamination. In each trial, an agent makes measurements until its budget is exhausted, then infers the underlying law to predict the system's future state. This setup directly evaluates the capabilities of inferring models from data and planning under constraints.
What carries the argument
The MaD Physics benchmark with its measurement budget protocol and altered physical law environments.
If this is right
- If correct, MaD Physics allows systematic testing of agents' trade-offs between measurement quality and quantity.
- It identifies specific shortcomings in current models in structured exploration.
- The benchmark can be used to evaluate multimodality and in-context learning in scientific tasks.
- Improvements in agents on this benchmark would indicate better handling of resource-constrained discovery.
Where Pith is reading between the lines
- Such benchmarks could guide development of AI systems that assist in actual lab experiments with cost limits.
- Connections might exist to fields like active learning or experimental design optimization.
- Testing on more models or real-world proxies would validate its usefulness beyond the current evaluations.
Load-bearing premise
That using altered physical laws effectively prevents agents from relying on pre-existing knowledge and that the benchmark environments accurately represent the challenges of scientific discovery under physical constraints.
What would settle it
If models that excel at MaD Physics still fail to make discoveries in actual constrained physical experiments, or if the alterations are insufficient to block knowledge recall.
read the original abstract
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MaD Physics, a benchmark with three environments based on distinct (altered) physical laws. Agents must make a budgeted sequence of measurements, infer the underlying law from the resulting data, and use the inferred model to predict future system states. The work evaluates four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, 3 Flash) on this task, reports shortcomings in structured exploration and data collection, and claims the benchmark isolates two core capabilities: model inference from data and planning under measurement constraints.
Significance. A well-validated benchmark that enforces explicit measurement budgets and uses altered laws to reduce contamination would address a genuine gap between static knowledge benchmarks and unconstrained experimental-design tasks. The proposal to evaluate multimodality and in-context learning within the same framework is a constructive extension. However, the absence of any quantitative validation that the altered laws actually block prior-knowledge leakage, combined with missing implementation details and environment verification, prevents the current manuscript from establishing that the measured performance reflects the claimed capabilities rather than retrieval plus adaptation.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Design): the central claim that agents must 'infer the underlying physical law purely from budgeted measurements' rests on the assertion that altered laws mitigate contamination from pre-training. No ablation, performance gap between altered vs. standard laws, or knowledge-free baseline is reported. Without such evidence the inference task cannot be shown to isolate data-driven model discovery rather than pattern retrieval plus adaptation.
- [§4 and §5] §4 (Experiments) and §5 (Results): the manuscript provides only high-level Gemini performance summaries and does not include environment validation (e.g., ground-truth measurement distributions, budget calibration, or checks that the prediction task requires the inferred law). Full implementation details are unavailable, so it is impossible to verify that the three environments actually enforce the claimed constraints or isolate the two target capabilities.
minor comments (2)
- [Abstract and §5] The abstract states that MaD Physics 'can be used to evaluate other capabilities such as multimodality and in-context learning,' yet the results section does not present separate metrics or controls for these capabilities; the reported Gemini numbers appear to conflate all factors.
- [§3] Notation for measurement budgets, quality/quantity trade-offs, and the exact form of the prediction task is introduced without a compact mathematical definition or pseudocode, making it difficult to replicate the exact interaction protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for identifying key areas where additional evidence and detail are needed to substantiate the benchmark's claims. We address each major comment below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Design): the central claim that agents must 'infer the underlying physical law purely from budgeted measurements' rests on the assertion that altered laws mitigate contamination from pre-training. No ablation, performance gap between altered vs. standard laws, or knowledge-free baseline is reported. Without such evidence the inference task cannot be shown to isolate data-driven model discovery rather than pattern retrieval plus adaptation.
Authors: We acknowledge that the manuscript currently lacks an ablation study, a direct performance comparison between altered and standard physical laws, or a knowledge-free baseline. This means the claim that altered laws successfully mitigate contamination is not yet quantitatively supported. In the revised manuscript we will add an ablation experiment reporting agent performance on both the altered laws and their standard-physics counterparts, together with a baseline that receives no measurement data. These additions will provide the missing evidence that the benchmark isolates data-driven model inference. revision: yes
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): the manuscript provides only high-level Gemini performance summaries and does not include environment validation (e.g., ground-truth measurement distributions, budget calibration, or checks that the prediction task requires the inferred law). Full implementation details are unavailable, so it is impossible to verify that the three environments actually enforce the claimed constraints or isolate the two target capabilities.
Authors: We agree that the current sections provide insufficient validation and implementation detail. The revised manuscript will expand §4 and §5 to include ground-truth measurement distributions, explicit budget-calibration procedures, and verification that accurate future-state prediction requires the inferred law rather than other cues. We will also release the full benchmark code and environment specifications to allow independent verification of the constraints and the isolation of the two target capabilities. revision: yes
Circularity Check
No circularity: benchmark proposal with independent design choices
full rationale
The paper proposes MaD Physics, a new benchmark consisting of three environments based on altered physical laws, to evaluate agents on model inference from budgeted measurements and planning under constraints. No derivations, equations, or first-principles results are presented that reduce to fitted parameters, self-citations, or inputs by construction. The altered laws are an explicit design choice to address contamination, not a self-defined or renamed result. Agent benchmarking with Gemini models constitutes external evaluation rather than any internal prediction or uniqueness theorem. The work is self-contained as an evaluation framework without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Altered physical laws can be used to test inference without prior knowledge contamination
- domain assumption Agents can infer underlying laws from limited measurements and make future predictions
invented entities (1)
-
MaD Physics benchmark environments
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearThe benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearEvery observation incurs a resource cost C(o_k, σ_k). This cost function is strictly increasing with fidelity...
Reference graph
Works this paper leans on
-
[1]
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
Claude Code Docs. Accessed: 2026-04- 13. J. Binney and S. Tremaine.Galactic Dynamics. Princeton University Press, Princeton, second edition, 2008. G. Y. Bogoslovsky. On the local anisotropy of space-time, inertia and force fields.Il Nuovo Cimento B (1971-1996), 77(2):181–190, 1983. J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
URL https://doi.org/10.1007/ s10994-025-06914-x
doi: 10.1007/s10994-025-06914-x. URL https://doi.org/10.1007/ s10994-025-06914-x. Y. Chen, P. Piekos, M. Ostaszewski, F. Laakom, and J. Schmidhuber. Physgym: Benchmarking llms in interactive physics discovery with con- trolled priors.arXiv preprint arXiv:2507.15550, 2025. D.Choudhury,S.Williamson,A.Goliński,N.Miao, F. B. Smith, M. Kirchhof, Y. Zhang, and ...
-
[3]
URL https://openreview.net/ forum?id=vyflgpwfJW. C.-W. Mak, G. Zhu, B. Zhang, H. Li, X. Chi, K. Zhang, Y. Wu, Y. He, C.-K. Fan, W. Lu, et al. Physicsmind: Sim and real mechanics bench- marking for physical reasoning and prediction in foundational vlms and world models.arXiv preprint arXiv:2601.16007, 2026. A. Melnik, R. Schiewer, M. Lange, A. I. Mure- san...
-
[4]
URL https://openreview.net/ forum?id=m5Qsh0kBQG. L. Phan, A. Gatti, N. Li, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146,
-
[5]
doi: 10.1038/s41586-025-09962-4. URL https://doi.org/10.1038/ s41586-025-09962-4. Y. Pu, T. Lin, and H. Chen. Piflow: Principle- aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047,
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4
-
[6]
arXiv preprint arXiv:2505.15047 , year=
doi: 10.48550/arXiv.2505.15047. URL https://arxiv.org/abs/2505.15047. S. Qiu, S. Guo, Z.-Y. Song, Y. Sun, Z. Cai, J. Wei, T. Luo, Y. Yin, H. Zhang, Y. Hu, et al. Phy- bench: Holistic evaluation of physical percep- tion and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025. T. Rainforth, A. Foster, D. R. Ivanova, and F. Bick- ford S...
-
[11]
** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . Your response should contain the action you want to take exactly in the fo ll ow in g format : ‘‘‘ json {{ " s el ec ti on ": [ {{" ob je ct _i d ": 0 , " quality ": " high "}} , ... , {{" ob je ct _i d ": k , " quality ": " ...
-
[13]
A c c u r a t e l y predict the state of the system at a future time . ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system . At each step , you will provide the fo ll ow in g p a r a m e t e r s :
-
[15]
** M e a s u r e m e n t Se le ct io n :** The specific lo ca ti on s (x , y ) you wish to observe
-
[16]
** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each location . Your response should contain the action you want to take in exactly the fo ll ow in g JSON format : 25 MaDPhysics : Evaluating information seeking under constraints in physical environments ‘‘‘ json {{ " s el ec ti on ": [ ...
-
[18]
** Particle :** The particle you wish to observe
-
[19]
** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . You will be allowed to make a maximum of { m a x _ o b s e r v a t i o n s _ p e r _ t r i a l } o b s e r v a t i o n s with a total budget of { budget } units . After which the system will reset to a fixed initial state . Y...
-
[20]
Derive the u n d e r l y i n g eq ua ti on s of motion go ve rn in g the system
-
[21]
A c c u r a t e l y predict the future po si ti on s of the bodies beyond the observed timeline . ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system . At each step , you will provide the fo ll ow in g p a r a m e t e r s :
-
[22]
** Time Interval :** The duration from the current time to the next o b s e r v a t i o n
-
[23]
** Target Subset :** The specific objects you wish to observe
-
[24]
** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . Your response should contain the action you want to take exactly in the fo ll ow in g 31 MaDPhysics : Evaluating information seeking under constraints in physical environments format : ‘‘‘ json { " s el ec ti on ": [ {" o bj ...
-
[25]
**Time Interval ($\\Delta t$):** $0.1$ seconds. This is short enough that the approximation holds well, but long enough to provide measurable 32 MaDPhysics : Evaluating information seeking under constraints in physical environments displacement from initial velocities
-
[26]
**Target Subset:** Objects 0, 1, and 2
-
[27]
**Measurement Quality:** High for all objects to minimize observational noise relative to the calculated acceleration
-
[29]
**Expected Information Gain:** High, as the resulting $\\bar{\\vec{a}}_i$ will either strongly confirm Hypothesis A or necessitate a structural revision. I will first use the interpreter to calculate the *theoretical* acceleration at $t=0$ based on Hypothesis A, providing a benchmark for the incoming data. ### Theoretical Calculation at t=0 (Hypothesis A)...
-
[30]
**Time Interval ($\\Delta t$):** $1.0$ second. This is the largest reasonable step to take now, maximizing the distance traveled and thus the change in relative separation $r_{ij}$
-
[31]
We need the full configuration to check the interaction geometry
**Target Subset:** Objects 0, 1, and 2. We need the full configuration to check the interaction geometry
-
[32]
Precision is paramount when testing subtle structural deviations
**Measurement Quality:** High. Precision is paramount when testing subtle structural deviations
-
[34]
**Budget:** $170.0 - 30.0 = 140.0$ remaining
-
[35]
**Expected Information Gain:** Maximized for structural resolution. If the system is truly inverse-square, the acceleration observed over this longer interval should *approximate* the initial acceleration $\\vec{a }(0)$ (since $\\Delta r$ is small relative to $r$), or at least show a consistent pattern. If the acceleration changes wildly, Hypothesis A is ...
work page 1992
-
[36]
**Step 1 (t=0.0 to 0.1):** Empirical acceleration $a_{emp}(0)$ significantly disagreed with $a_{theory}(0)$
-
[37]
**Step 2 (t=0.1 to 1.1):** The average acceleration $\\bar{a}_{0.1 \\to 1.1}$ disagreed spectacularly with the theoretical acceleration $\\vec{a} _{theory}(0.1)$ calculated using the positions at $t=0.1$. For Object 0, the empirical acceleration in the x-direction was $2.67$, while the theoretical prediction based on $1/r^2$ was only $0.067$. The structur...
-
[38]
**Time Interval ($\\Delta t$):** $5.0$ seconds. This is a significant step to see if the configuration stabilizes or spirals into a more complex motion
-
[39]
37 MaDPhysics : Evaluating information seeking under constraints in physical environments
**Target Subset:** All three objects. 37 MaDPhysics : Evaluating information seeking under constraints in physical environments
-
[40]
We need reliable tracking of the velocity change
**Measurement Quality:** High. We need reliable tracking of the velocity change
-
[42]
**Budget:** $140.0 - 30.0 = 110.0$ remaining
-
[43]
**Expected Information Gain:** High. This step will determine if the resulting acceleration vector $\\bar{a}_{1.1 \\to 6.1}$ is consistently different from the $p=2$ prediction based on $\\vec{r}(1.1)$, or if it follows *any* discernible pattern relative to the $r$ changes. We will use the calculated $\\bar{a}_{0.1 \\to 1.1}$ to approximate $\\vec{V }(1.1...
-
[44]
**Time Interval ($\\Delta t$):** $5.0$ seconds
-
[46]
**Measurement Quality:** High, to capture the true evolution of position and subsequently derive the acceleration over this long interval
-
[47]
**Cost:** $3 \\times 10 = 30.0$
-
[48]
**Budget:** $140.0 - 30.0 = 110.0$ remaining. 38 MaDPhysics : Evaluating information seeking under constraints in physical environments The action taken will be: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}, {"object_id": 2, "quality": "high"} ], "time_delta": 5.0 } ‘‘‘’, ’executable_code’: ’import nump...
work page 1992
-
[49]
Short enough to approximate instantaneous acceleration
**Time Interval ($\\Delta t$):** $0.1$ seconds. Short enough to approximate instantaneous acceleration
-
[52]
**Budget:** $80.0$ remaining. The planned action is: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}, {"object_id": 2, "quality": "high"} ], "time_delta": 0.1 } ‘‘‘’, ’executable_code’: ’import numpy as np # --- Data from Step 1 (t=0.0 to t=0.1) --- # Initial State (t=0) r0 = np.array([ [-8.80812, 0.37882]...
work page 1992
-
[53]
(Re-center, use $\\bar{a} _{6.1}$ to verify $\\vec{V}(6.2)$)
**Time Interval ($\\Delta t$):** $0.1$ seconds. (Re-center, use $\\bar{a} _{6.1}$ to verify $\\vec{V}(6.2)$)
-
[54]
**Target Subset:** All objects
-
[56]
**Budget:** $80.0 - 30.0 = 50.0$ remaining. We will calculate $\\bar{a}_{6.1}$ against $a_{theory, p=1}(6.1)$. If the ratios are closer to 1 than the $p=2$ test, we have identified $p=1$. I will perform the calculations in the next step, focusing on the $p=1$ test using the state at $t=6.1$. If $p=1$ is rejected, the budget is too low 46 MaDPhysics : Eval...
work page 1992
-
[57]
**Update State:** Calculate the best estimate for $\\vec{V}(6.3)$ using $ \\bar{a}_{6.1 \\to 6.3}$
-
[58]
* **Equation of Motion (Local Model):** $\\vec{a}_i(t) = \\bar{a}_{i, 6.1 \\to 6.3}$ for $t > 6.3$
**Final Model Selection:** Since no structural law holds, the best local model is the **Constant Acceleration Model** derived from the last, most recent measurement interval ($[6.1, 6.3]$). * **Equation of Motion (Local Model):** $\\vec{a}_i(t) = \\bar{a}_{i, 6.1 \\to 6.3}$ for $t > 6.3$
-
[59]
**Budget Allocation:** The remaining budget (50.0) must be used to confirm $\\vec{V}(6.3)$ and explore the system\’s boundary conditions if possible. A short, high-precision step at $\\Delta t=0.1$ is the last chance to shrink $\\vec{V}(6.3)$ uncertainty before the budget runs out. **Experiment Design (Step 6 - Final Velocity Refinement):**
-
[60]
**Time Interval ($\\Delta t$):** $0.1$ seconds
-
[63]
**Budget:** $50.0 - 30.0 = 20.0$ remaining. This leaves 20.0 units for a final, very cheap, low-precision test, or the investigation terminates here. I will commit the 30.0 now to maximize the final velocity accuracy before termination. The planned action is: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}...
work page 1992
-
[64]
54 MaDPhysics : Evaluating information seeking under constraints in physical environments
**Time Interval ($\\Delta t$):** $0.1$ seconds. 54 MaDPhysics : Evaluating information seeking under constraints in physical environments
-
[65]
**Target Subset:** All objects (0, 1, 2)
-
[66]
**Measurement Quality:** High
-
[67]
**Cost:** $30.0$. **Wait, I only have 20.0 remaining.** I can only afford a maximum cost of 20.0. I must choose an observation that costs $\\le 20.0$. * Two \’medium\’ observations (2 * 5 = 10.0) + Two \’low\’ observations (2 * 2 = 4.0) = 14.0 total for 2 objects? No, quality must be assigned per object. * $2 \\times 10$ (High) is 20.0, but I have 3 objec...
work page 1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.