The Value of Mechanistic Priors in Sequential Decision Making
Pith reviewed 2026-05-12 03:23 UTC · model grok-4.3
The pith
Mechanistic priors scale Bayesian regret with residual entropy to deliver sample complexity reductions of H(μ)/H_mech in sequential decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the mechanistic information of a model—the mutual information between the model's recommended policy ˆπ and the true optimal policy π*—quantified via an occupancy-weighted bias B_μ. In the asymptotic regime (large N), matched bounds reveal that Bayesian regret scales with the residual entropy H_mech, delivering a theoretical sample complexity reduction of H(μ)/H_mech compared to an uninformed baseline. We also provide a model certificate to determine empirical sample efficiency. In the burn-in regime (small N) we establish a lower bound on the penalty incurred by confidently wrong priors, and demonstrate both sets of bounds on 5-FU dosing simulations drawn from published FOLFOX
What carries the argument
Mechanistic information: the mutual information between the model's recommended policy and the true optimal policy, computed via occupancy-weighted bias B_μ that determines residual entropy H_mech and the prior's value.
If this is right
- Bayesian regret grows linearly with residual entropy after the mechanistic prior is applied.
- A model certificate can be computed to certify empirical sample efficiency from observable quantities.
- Confidently incorrect priors incur a bounded but positive penalty in the small-sample regime.
- Physically grounded priors retain higher mechanistic information than LLM priors on the same task.
Where Pith is reading between the lines
- The occupancy-weighted definition suggests a general recipe for injecting domain knowledge into any sequential decision problem where visitation frequencies can be estimated.
- Safety-critical applications should prefer physically derived priors over broad generative models precisely because the latter can erase mechanistic structure.
- The framework could be used to rank candidate priors by their expected H_mech before any online interaction begins.
Load-bearing premise
The mechanistic prior is sufficiently accurate and the occupancy measure μ accurately reflects policy overlap without introducing unaccounted bias, especially when data are scarce.
What would settle it
Measure empirical Bayesian regret in the 5-FU dosing simulation for increasing numbers of patients and check whether it tracks the predicted linear scaling with residual entropy H_mech once N is large.
Figures
read the original abstract
Hybrid mechanistic models, physical priors with learned residuals, promise to reduce the data required for good decisions, but have no computable criterion to test this. We characterize the value of mechanistic priors in sequential decision-making within both asymptotic and burn-in regimes. To formalize this, we introduce the mechanistic information of a model -- the mutual information between the model's recommended policy $\hat{\pi}$ and the true optimal policy $\pi^*$ -- quantified via an occupancy-weighted bias $B_\mu$. In the asymptotic regime (large $N$), matched bounds reveal that Bayesian regret scales with the residual entropy $H_{\mathrm{mech}}$, delivering a theoretical sample complexity reduction of $H(\mu)/H_{\mathrm{mech}}$ compared to an uninformed baseline. Furthermore, we provide a model certificate to determine empirical sample efficiency. Complementarily, in the clinically relevant burn-in regime (small $N$), we establish a lower bound on the penalty incurred by confidently wrong priors. We demonstrate both the asymptotic and burn-in bounds across 5-fluorouracil (5-FU) dosing simulations motivated by published FOLFOX pharmacokinetic data, where a hybrid prior yields large sample-efficiency gains in the burn-in regime. Finally, we contrast these grounded models with LLM priors, demonstrating that LLMs can suffer severe losses in mechanistic information, thereby motivating the exclusive use of physically-grounded priors for safety-critical applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to characterize the value of mechanistic priors in sequential decision making by introducing 'mechanistic information' as the mutual information I(π̂; π*) quantified by occupancy-weighted bias B_μ. In the asymptotic large-N regime, it asserts matched bounds showing Bayesian regret scales with residual entropy H_mech, yielding sample complexity reduction H(μ)/H_mech vs uninformed baseline. It also gives a lower bound on penalty for wrong priors in small-N burn-in regime, provides a model certificate for empirical efficiency, and demonstrates gains in 5-FU dosing simulations from FOLFOX PK data, while noting LLM priors can have low mechanistic information.
Significance. If the bounds are rigorously established without circularity, this provides a novel theoretical framework for assessing data efficiency gains from hybrid mechanistic models in RL, with direct relevance to clinical applications. The inclusion of both asymptotic and burn-in analyses, plus empirical validation on pharmacokinetic simulations, adds practical value. Explicit credit for reproducible simulation setup motivated by published data and for highlighting risks with LLM priors in safety-critical settings.
major comments (2)
- [Definition of mechanistic information and B_μ] The occupancy measure μ is described as the state-action occupancy of the policy recommended by the mechanistic model. This choice risks making B_μ and thus H_mech dependent on the prior itself, potentially rendering the regret scaling with H_mech and the reduction factor H(μ)/H_mech tautological rather than an independent prediction. Please provide a formal definition (e.g., Eq. for B_μ) and show that the mutual information remains unbiased or that the bounds hold for a fixed reference μ independent of the model.
- [Asymptotic regime analysis] The claim of matched upper and lower bounds on Bayesian regret scaling with H_mech is central but lacks visible derivation steps or exact definitions of H_mech in the abstract. In the section presenting these bounds, include the key equations and proof outline to allow verification that the scaling is not an artifact of the definition.
minor comments (2)
- [Abstract] The abstract mentions a 'model certificate to determine empirical sample efficiency' but provides no details; expand briefly or reference the relevant section.
- [Simulations] For the 5-FU dosing simulations, specify the number of independent runs, exact controls for the uninformed baseline, and any hyperparameter choices to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of our theoretical framework. We address each major comment below and will incorporate revisions to enhance clarity and rigor.
read point-by-point responses
-
Referee: [Definition of mechanistic information and B_μ] The occupancy measure μ is described as the state-action occupancy of the policy recommended by the mechanistic model. This choice risks making B_μ and thus H_mech dependent on the prior itself, potentially rendering the regret scaling with H_mech and the reduction factor H(μ)/H_mech tautological rather than an independent prediction. Please provide a formal definition (e.g., Eq. for B_μ) and show that the mutual information remains unbiased or that the bounds hold for a fixed reference μ independent of the model.
Authors: We acknowledge the potential for circularity in the current presentation and agree that clarification is needed. In the revised manuscript, we will provide the formal definition of B_μ as the occupancy-weighted bias with respect to a fixed reference occupancy measure μ, chosen independently of the mechanistic model (e.g., the occupancy induced by the true optimal policy π* or a baseline policy). This ensures that the mechanistic information I(π̂; π*) is defined with respect to an external reference, avoiding dependence on the prior. We will also demonstrate that the regret bounds hold under this fixed μ, confirming they are not tautological. revision: yes
-
Referee: [Asymptotic regime analysis] The claim of matched upper and lower bounds on Bayesian regret scaling with H_mech is central but lacks visible derivation steps or exact definitions of H_mech in the abstract. In the section presenting these bounds, include the key equations and proof outline to allow verification that the scaling is not an artifact of the definition.
Authors: We agree that additional detail on the derivations would improve verifiability. In the revised version, we will expand the section on the asymptotic regime to include the precise definition of the residual entropy H_mech (as the entropy of the residual uncertainty after incorporating the mechanistic prior) and outline the key steps in the proofs for both the upper and lower bounds on Bayesian regret. This will explicitly show how the scaling with H_mech arises from information-theoretic arguments and is independent of definitional artifacts. revision: yes
Circularity Check
Regret scaling with H_mech and reduction H(μ)/H_mech reduce to the definition of mechanistic information via model-induced μ
specific steps
-
self definitional
[Abstract (mechanistic information definition and asymptotic regime claim)]
"we introduce the mechanistic information of a model -- the mutual information between the model's recommended policy π̂ and the true optimal policy π* -- quantified via an occupancy-weighted bias B_μ. In the asymptotic regime (large N), matched bounds reveal that Bayesian regret scales with the residual entropy H_mech, delivering a theoretical sample complexity reduction of H(μ)/H_mech compared to an uninformed baseline."
H_mech is the residual entropy after subtracting the mechanistic information I(π̂; π*), which is itself quantified by B_μ using the occupancy measure μ induced by the model's recommended policy π̂. Substituting the model's own occupancy for a reference measure alters B_μ and therefore H_mech, so the scaling of regret with H_mech and the explicit reduction factor H(μ)/H_mech are obtained by construction from the prior's definition rather than derived independently.
full rationale
The paper introduces mechanistic information as I(π̂; π*) quantified by occupancy-weighted bias B_μ, then states that matched bounds show Bayesian regret scales with residual entropy H_mech (entropy after subtracting this I) and yields sample-complexity reduction H(μ)/H_mech. Because μ is the state-action occupancy of the policy recommended by the mechanistic model itself, both B_μ and the resulting H_mech are constructed from the prior's own output. The claimed asymptotic scaling and reduction factor are therefore equivalent to the amount of information the prior was defined to capture, rather than an independent first-principles prediction. The burn-in lower bound on wrong priors stands separately and does not exhibit this reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The decision problem is a Markov decision process where the mechanistic component provides a structured prior over dynamics or rewards.
invented entities (3)
-
mechanistic information
no independent evidence
-
occupancy-weighted bias B_μ
no independent evidence
-
residual entropy H_mech
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mechanistic information Rmech = I_μ(π*; π̂) quantified via occupancy-weighted bias B_μ ... residual entropy Hmech = H(μ) − Rmech ... Bayesian regret scales with √(K N Hmech / log K)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Bμ = √(Σ μ(πk) (J(πk; M*) − J(πk; M̂))² ) ... channel capacity C(Bμ) = (dF/2) log(1 + κ²μ σ²F / (κ²μ B²μ + σ²))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Analysis of thompson sampling for the multi-armed bandit problem
Shipra Agrawal and Navin Goyal. “Analysis of thompson sampling for the multi-armed bandit problem”. In:Conference on learning theory. JMLR Workshop and Conference Proceedings. 2012, pp. 39.1–39.26
work page 2012
-
[2]
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. “Finite-time Analysis of the Multiarmed Bandit Problem”. In:Mach. Learn.47.2–3 (May 2002), pp. 235–256.ISSN: 0885-6125.DOI: 10.1023/A:1013689704352.URL:https://doi.org/10.1023/A:1013689704352
work page doi:10.1023/a:1013689704352.url:https://doi.org/10.1023/a:1013689704352 2002
-
[3]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
M. G. Azar, R. Munos, and H. J. Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In:Machine Learning91.3 (2013), pp. 325– 349
work page 2013
-
[4]
Springer Science & Business Media, 2013
J Frédéric Bonnans and Alexander Shapiro.Perturbation analysis of optimization problems. Springer Science & Business Media, 2013
work page 2013
-
[5]
Olivier Capitain et al. “Individual fluorouracil dose adjustment in FOLFOX based on phar- macokinetic follow-up compared with conventional body-area-surface dosing: a phase II, proof-of-concept study”. In:Clinical colorectal cancer11.4 (2012), pp. 263–267
work page 2012
-
[6]
Neural ordinary differential equations
Ricky TQ Chen et al. “Neural ordinary differential equations”. In:Advances in neural informa- tion processing systems31 (2018)
work page 2018
-
[7]
On kernelized multi-armed bandits
Sayak Ray Chowdhury and Aditya Gopalan. “On kernelized multi-armed bandits”. In:Pro- ceedings of the 34th International Conference on Machine Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 844–853
work page 2017
-
[8]
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. 2nd ed. Wiley- Interscience, 2006
work page 2006
-
[9]
On the sample complexity of the linear quadratic regulator
Sarah Dean et al. “On the sample complexity of the linear quadratic regulator”. In:Foundations of Computational Mathematics20.4 (2020), pp. 633–679
work page 2020
-
[10]
The arrival of digital twins and in silico trials in drug development
Ashley L. Eadie et al. “The arrival of digital twins and in silico trials in drug development”. In: Nature Medicine(2026)
work page 2026
-
[11]
Tree-based batch mode reinforcement learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel. “Tree-based batch mode reinforcement learning”. In:Journal of Machine Learning Research6 (2005)
work page 2005
-
[12]
Pharmacokinetically guided algorithm of 5-fluorouracil dosing: a meta-analysis
L. Fang, W. Xin, H. Ding, et al. “Pharmacokinetically guided algorithm of 5-fluorouracil dosing: a meta-analysis”. In:Scientific Reports6 (2016), p. 25913
work page 2016
-
[13]
Circadian variation in plasma 5-fluorouracil concentrations during a 24 hour constant-rate infusion
Gini F Fleming et al. “Circadian variation in plasma 5-fluorouracil concentrations during a 24 hour constant-rate infusion”. In:BMC cancer15.1 (2015), p. 69
work page 2015
-
[14]
Erick Gamelin et al. “Individual Fluorouracil Dose Adjustment Based on Pharmacokinetic Follow-Up Compared With Conventional Dosage: Results of a Multicenter Randomized Trial of Patients With Metastatic Colorectal Cancer”. In:Journal of Clinical Oncology26.13 (2008). PMID: 18445839, pp. 2099–2105.DOI: 10 . 1200 / JCO . 2007 . 13 . 3934. eprint: https : / /...
-
[15]
LISA: Learning Interpretable Skill Abstractions from Language
Divyansh Garg et al. “LISA: Learning Interpretable Skill Abstractions from Language”. In: Advances in Neural Information Processing Systems. Ed. by Alice H. Oh et al. 2022.URL: https://openreview.net/forum?id=XZhipvOUBB
work page 2022
-
[16]
Thompson Sampling for Complex Online Problems
Aditya Gopalan, Shie Mannor, and Yishay Mansour. “Thompson Sampling for Complex Online Problems”. In:Proceedings of the 31st International Conference on Machine Learning. Ed. by Eric P. Xing and Tony Jebara. V ol. 32. Proceedings of Machine Learning Research. Bejing, China: PMLR, 2014, pp. 100–108.URL: https://proceedings.mlr.press/v32/ gopalan14.html
work page 2014
-
[17]
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models
Junya Honda and Akimichi Takemura. “An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.” In:COLT 2010 - The 23rd Conference on Learning Theory. Jan. 2010, pp. 67–79
work page 2010
-
[18]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
brian ichter et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”. In:6th Annual Conference on Robot Learning. 2022.URL: https://openreview.net/ forum?id=bdHkMjBJG_w
work page 2022
-
[19]
Provably efficient reinforcement learning with linear function approximation
Chi Jin et al. “Provably efficient reinforcement learning with linear function approximation”. In:Conference on learning theory. PMLR. 2020, pp. 2137–2143. 10
work page 2020
-
[20]
Rajesh R. Kaldate et al. “Modeling the 5-fluorouracil area under the curve versus dose relation- ship to develop a pharmacokinetic dosing algorithm for colorectal cancer patients receiving FOLFOX6”. In:The Oncologist17.3 (2012), pp. 296–302
work page 2012
-
[21]
Physics-informed machine learning
George Em Karniadakis et al. “Physics-informed machine learning”. In:Nature Reviews Physics3.6 (2021), pp. 422–440
work page 2021
-
[22]
Branislav Kveton et al. “Meta-Thompson Sampling”. In:Proceedings of the 38th International Conference on Machine Learning. Ed. by Marina Meila and Tong Zhang. V ol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 5884–5893.URL: https://proceedings. mlr.press/v139/kveton21a.html
work page 2021
-
[23]
Asymptotically efficient adaptive allocation rules
T.L Lai and Herbert Robbins. “Asymptotically efficient adaptive allocation rules”. In:Advances in Applied Mathematics6.1 (1985), pp. 4–22.ISSN: 0196-8858.DOI: https://doi.org/10. 1016/0196- 8858(85)90002- 8 .URL: https://www.sciencedirect.com/science/ article/pii/0196885885900028
-
[24]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020
work page 2020
-
[25]
Bayesian multi-task reinforcement learning
Alessandro Lazaric and Mohammad Ghavamzadeh. “Bayesian multi-task reinforcement learning”. In:Proceedings of the 27th International Conference on International Confer- ence on Machine Learning. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 599–606.ISBN: 9781605589077
work page 2010
-
[26]
Circadian timing in cancer treatments
Francis A. Lévi et al. “Circadian timing in cancer treatments”. In:Annual Review of Pharma- cology and Toxicology50 (2010), pp. 377–421
work page 2010
-
[27]
M. Li et al. “Drug monitoring detects under- and overdosing in patients receiving 5-fluorouracil- containing chemotherapy: results of a prospective, multicenter German observational study”. In:ESMO Open8.2 (2023), p. 101201
work page 2023
-
[28]
On the Prior Sensitivity of Thompson Sampling
Che-Yu Liu and Lihong Li. “On the Prior Sensitivity of Thompson Sampling”. In:Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT). Springer, 2016, pp. 321–336.DOI:10.1007/978-3-319-46379-7\_22
-
[29]
Katarzyna Morawska et al. “5-FU therapeutic drug monitoring as a valuable option to reduce toxicity in patients with gastrointestinal cancer”. In:Oncotarget9.14 (2018), p. 11559
work page 2018
-
[30]
Universal Differential Equations for Scientific Machine Learning
Christopher Rackauckas et al. “Universal differential equations for scientific machine learning”. In:arXiv preprint arXiv:2001.04385(2020)
work page internal anchor Pith review arXiv 2001
-
[31]
M. Raissi, P. Perdikaris, and G.E. Karniadakis. “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations”. In:Journal of Computational Physics378 (2019), pp. 686–707.ISSN: 0021-9991.DOI: https : / / doi . org / 10 . 1016 / j . jcp . 2018 . 10 . 045.URL: https :...
work page 2019
-
[32]
Learning to Optimize via Posterior Sampling
Daniel Russo and Benjamin Van Roy. “Learning to Optimize via Posterior Sampling”. In: Mathematics of Operations Research39.4 (2014), pp. 1221–1243.ISSN: 0364765X, 15265471. URL:http://www.jstor.org/stable/24541007(visited on 04/11/2026)
-
[33]
An information-theoretic analysis of thompson sam- pling
Daniel Russo and Benjamin Van Roy. “An information-theoretic analysis of thompson sam- pling”. In:Journal of Machine Learning Research17.68 (2016), pp. 1–30
work page 2016
-
[34]
A tutorial on thompson sampling
Daniel J. Russo et al. “A tutorial on thompson sampling”. In:Foundations and Trends® in Machine Learning11.1 (2018), pp. 1–99
work page 2018
-
[35]
M. Wasif Saif et al. “Pharmacokinetically guided dose adjustment of 5-fluorouracil: a rational approach to improving therapeutic outcomes”. In:JNCI: Journal of the National Cancer Institute101.22 (2009), pp. 1543–1552
work page 2009
-
[36]
Informing sequential clinical decision-making through reinforce- ment learning: an empirical study
Susan M. Shortreed et al. “Informing sequential clinical decision-making through reinforce- ment learning: an empirical study”. In:Mach. Learn.84.1–2 (July 2011), pp. 109–136.ISSN: 0885-6125.DOI: 10.1007/s10994- 010- 5229- 0 .URL: https://doi.org/10.1007/ s10994-010-5229-0
-
[37]
On Bits and Bandits: Quantifying the Regret-Information Trade-off
Itai Shufaro et al. “On Bits and Bandits: Quantifying the Regret-Information Trade-off”. In: The Thirteenth International Conference on Learning Representations. 2025.URL: https: //openreview.net/forum?id=0oWGVvC6oq
work page 2025
-
[38]
Gaussian process optimization in the bandit setting: no regret and experimental design
Niranjan Srinivas et al. “Gaussian process optimization in the bandit setting: no regret and experimental design”. In:Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 1015–1022. ISBN: 9781605589077. 11
work page 2010
-
[39]
William R Thompson. “On The Likelihood That One Unknown Probability Exceeds Another in View of The Evidence of Two Samples”. In:Biometrika25.3-4 (Dec. 1933), pp. 285–294. ISSN: 0006-3444.DOI: 10 . 1093 / biomet / 25 . 3 - 4 . 285. eprint: https : / / academic . oup . com / biomet / article - pdf / 25 / 3 - 4 / 285 / 513725 / 25 - 3 - 4 - 285 . pdf.URL: ht...
-
[40]
Anna D Wagner et al. “Sex and adverse events of adjuvant chemotherapy in colon cancer: an analysis of 34 640 patients in the ACCENT database”. In:JNCI: Journal of the National Cancer Institute113.4 (2021), pp. 400–407
work page 2021
-
[41]
Optimum Character of the Sequential Probability Ratio Test
Abraham Wald and Jacob Wolfowitz. “Optimum Character of the Sequential Probability Ratio Test”. In:Annals of Mathematical Statistics19 (1948), pp. 326–339.URL: https : //api.semanticscholar.org/CorpusID:122130353
work page 1948
-
[42]
Martin Wilhelm et al. “Prospective, multicenter study of 5-fluorouracil therapeutic drug monitoring in metastatic colorectal cancer treated in routine clinical practice”. In:Clinical Colorectal Cancer15.4 (2016), pp. 381–388.DOI:10.1016/j.clcc.2016.04.001
-
[43]
Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting
Yuan Yin et al. “Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting”. In:International Conference on Learning Representations. 2021.URL: https: //openreview.net/forum?id=kmG8vRXTFv. 12 A Notation and conventions The following table summarises the symbols used throughout the paper, in order of first appearance. Full formal defin...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.