pith. sign in

arxiv: 2605.26627 · v2 · pith:2UANGOPJnew · submitted 2026-05-26 · 📡 eess.SY · cs.RO· cs.SY

Breaking the Epistemic Trap: Active Perception Under Compound Uncertainty

Pith reviewed 2026-06-30 11:15 UTC · model grok-4.3

classification 📡 eess.SY cs.ROcs.SY
keywords epistemic trapcompound uncertaintyreinforcement learningactive perceptionsafety critical systemsinformation seekingadaptive constraintsmutual information
0
0 comments X

The pith

Reinforcement learning agents face an epistemic trap where state uncertainty and dynamics uncertainty reinforce each other, producing performance drops far larger than their separate effects would suggest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the main obstacle to safe reinforcement learning in unfamiliar conditions is not any single uncertainty but the way state estimation and dynamics learning block each other. Without accurate dynamics knowledge an agent cannot maintain a reliable state estimate, and without a reliable state estimate it cannot learn the dynamics. In simulated locomotion tasks the combined uncertainties produced a 77 percent degradation, well above the 46 percent expected from adding the two effects separately. Conventional passive methods leave the agent unable to escape this loop. The authors therefore treat safety as an information problem and supply an architecture that actively gathers the missing information.

Core claim

The central claim is that state uncertainty and dynamics uncertainty interact synergistically to form an epistemic trap, in which the agent cannot resolve either source of ignorance without first resolving the other. Proof-of-concept experiments show that the resulting performance loss (77 percent degradation) substantially exceeds the additive prediction (46 percent). Conventional passive robustness approaches cannot break the coupling. The Adaptive Safety Architecture addresses the trap with three elements: a mutual-information metric called the Compound Uncertainty Coefficient that measures the strength of the coupling, information-seeking policies driven by a MaxInfoRL objective that act

What carries the argument

The Epistemic Trap, the mutual reinforcement between state estimation and dynamics learning that prevents resolution of either without the other.

If this is right

  • Treating safety as an information problem allows agents to move from passive robustness to active perception that deliberately reduces epistemic coupling.
  • The Compound Uncertainty Coefficient supplies a concrete scalar that can be monitored in real time to adjust safety margins.
  • MaxInfoRL policies can replace waiting for the environment to reveal itself with deliberate actions that resolve dynamics uncertainty.
  • Regime-adaptive constraints provide a mechanism that automatically increases conservatism exactly when epistemic coupling is high.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coupling metric could be used to decide when to switch between exploration and exploitation in other partially observable control problems.
  • Physical deployment would need to check whether the active probing actions themselves create transient instability not captured in simulation.
  • The architecture offers a template for other domains where model and state uncertainties are known to interact, such as medical decision systems or process control.

Load-bearing premise

The synergistic interaction between state and dynamics uncertainty can be quantified via mutual information and resolved through information-seeking policies and regime-adaptive constraints without introducing new unmodeled failure modes.

What would settle it

An experiment in which the Adaptive Safety Architecture is applied to a new locomotion or control task and the observed degradation under combined uncertainties falls to or below the additive baseline while no additional failure modes appear from the active probing.

read the original abstract

Deploying reinforcement learning in safety critical domains, from autonomous vehicles to medical decision support, is constrained by failures arising when systems encounter unfamiliar conditions. We argue that the fundamental bottleneck is not individual challenges like changing dynamics or incomplete observations, but their synergistic interaction, which we term the Epistemic Trap: agents cannot estimate their state without knowing system dynamics, nor learn dynamics without accurate state information. Proof-of-concept experiments in simulated locomotion reveal that combining these uncertainties causes failures far worse than either challenge alone, a 77% observed degradation against the 46% additive prediction, demonstrating that compounding failure modes can emerge and, when they do, far exceed what additive reasoning would predict. Conventional approaches typically adopt a passive epistemic stance that cannot resolve this coupled uncertainty. We propose reframing safety as an information problem. We introduce an Adaptive Safety Architecture built around three contributions. First, the Compound Uncertainty Coefficient ($\kappa$), a mutual-information based metric that quantifies how tightly state and dynamics uncertainties are coupled. Second, information-seeking policies governed by a MaxInfoRL objective that actively probe system dynamics rather than waiting for the environment to reveal itself passively. Third, regime adaptive safety constraints that tighten automatically as epistemic coupling rises. Together, these constitute a paradigm shift from passive robustness to active perception, offering a principled path toward decision making systems that operate under uncertainty, recognize their own ignorance, and act strategically to resolve it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that the synergistic interaction between state uncertainty and dynamics uncertainty creates an 'Epistemic Trap' that cannot be resolved by passive robustness methods. Proof-of-concept locomotion experiments are reported to show a 77% performance degradation under combined uncertainties versus a 46% additive baseline, motivating three contributions: the mutual-information-based Compound Uncertainty Coefficient (κ), the MaxInfoRL objective for active information-seeking policies, and regime-adaptive safety constraints within an Adaptive Safety Architecture.

Significance. If the reported super-additive degradation is reproducible and the proposed architecture demonstrably resolves the trap without new failure modes, the work would usefully shift emphasis in safe RL from passive robustness to active perception. The κ metric and MaxInfoRL objective could provide concrete tools for quantifying and acting on epistemic coupling, provided they are shown to be well-defined and parameter-light.

major comments (3)
  1. [Proof-of-concept experiments (abstract and §4)] The central empirical claim (77% observed degradation vs. 46% additive prediction) is presented without any description of the locomotion simulator or task, the precise methods used to instantiate state uncertainty and dynamics uncertainty independently and jointly, the performance metric whose degradation is measured, or the calculation yielding the additive baseline. This information is load-bearing for the motivation of the entire Adaptive Safety Architecture.
  2. [Compound Uncertainty Coefficient (κ) definition] The Compound Uncertainty Coefficient (κ) is defined in terms of mutual information between state and dynamics uncertainties, yet no explicit formula, normalization, or derivation appears; it is therefore impossible to determine whether κ is a new quantity or reduces to a standard mutual-information expression by construction.
  3. [Adaptive Safety Architecture (§3)] The MaxInfoRL objective and the regime-adaptive safety constraints are introduced as solutions to the epistemic trap, but no derivation, convergence argument, or analysis of potential new failure modes introduced by the active probing policy is supplied.
minor comments (2)
  1. Notation for the mutual-information terms inside κ should be introduced with an explicit equation rather than prose description only.
  2. The manuscript would benefit from a short related-work subsection contrasting κ with existing measures of epistemic uncertainty (e.g., information gain in Bayesian RL) to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional clarity is needed to support the central claims. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Proof-of-concept experiments (abstract and §4)] The central empirical claim (77% observed degradation vs. 46% additive prediction) is presented without any description of the locomotion simulator or task, the precise methods used to instantiate state uncertainty and dynamics uncertainty independently and jointly, the performance metric whose degradation is measured, or the calculation yielding the additive baseline. This information is load-bearing for the motivation of the entire Adaptive Safety Architecture.

    Authors: We agree that the experimental setup is under-specified in the current version and that this detail is essential for evaluating the motivation. In the revised manuscript we will expand §4 with a complete description of the locomotion simulator and task, the independent and joint instantiation of state uncertainty (via observation noise) and dynamics uncertainty (via parameter randomization), the performance metric (cumulative reward), and the exact additive baseline calculation (sum of the two isolated degradations). revision: yes

  2. Referee: [Compound Uncertainty Coefficient (κ) definition] The Compound Uncertainty Coefficient (κ) is defined in terms of mutual information between state and dynamics uncertainties, yet no explicit formula, normalization, or derivation appears; it is therefore impossible to determine whether κ is a new quantity or reduces to a standard mutual-information expression by construction.

    Authors: The referee is correct that the explicit definition is absent. We will insert the formal definition κ = I(U_s ; U_d) / H(U_s , U_d) together with its normalization and a short derivation showing that it quantifies synergistic coupling beyond additive mutual information. This will appear in the revised §3. revision: yes

  3. Referee: [Adaptive Safety Architecture (§3)] The MaxInfoRL objective and the regime-adaptive safety constraints are introduced as solutions to the epistemic trap, but no derivation, convergence argument, or analysis of potential new failure modes introduced by the active probing policy is supplied.

    Authors: We acknowledge the absence of these supporting arguments. The revision will add (i) the derivation of the MaxInfoRL objective as a standard RL reward augmented by an information-gain term, (ii) a sketch of convergence under bounded uncertainty and sufficient exploration, and (iii) a discussion of possible new failure modes (e.g., over-probing) together with how the adaptive constraints limit them. These additions will be placed in §3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation and standard MI definition remain independent of proposed architecture.

full rationale

The paper reports an empirical result (77% degradation vs 46% additive) from simulated locomotion experiments as motivation, then defines κ explicitly as a mutual-information quantity between state and dynamics uncertainties (a standard, externally defined measure) and introduces MaxInfoRL plus regime-adaptive constraints as new proposals. No equations, self-citations, or fitted parameters are shown reducing the central claim or the new components back to the inputs by construction. The derivation chain is therefore self-contained and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review prevents identification of specific fitted parameters or axioms; the Compound Uncertainty Coefficient and Epistemic Trap are introduced as new constructs without independent evidence supplied in the provided text.

invented entities (2)
  • Epistemic Trap no independent evidence
    purpose: Names the synergistic interaction between state and dynamics uncertainty
    Introduced in the abstract as the fundamental bottleneck; no independent falsifiable prediction given.
  • Compound Uncertainty Coefficient (κ) no independent evidence
    purpose: Quantifies coupling between state and dynamics uncertainties via mutual information
    Defined as a new metric in the abstract; no derivation or validation details available.

pith-pipeline@v0.9.1-grok · 5781 in / 1442 out tokens · 39956 ms · 2026-06-30T11:15:03.356875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    NPJ Digital Medicine7(1), 126 (2024)

    Goetz, L., Seedat, N., Vandersluis, R., Schaar, M.: Generalization—a key chal- lenge for responsible ai in patient-facing clinical applications. NPJ Digital Medicine7(1), 126 (2024)

  2. [2]

    In: Forty-first International Conference on Machine Learning (2024)

    Xu, T., Li, Z., Ren, Q.: Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning. In: Forty-first International Conference on Machine Learning (2024)

  3. [3]

    In: International Conference on Machine Learning, pp

    Pinto, L., Davidson, J., Sukthankar, R., Gupta, A.: Robust adversarial reinforce- ment learning. In: International Conference on Machine Learning, pp. 2817–2826 (2017). PMLR 16

  4. [4]

    arXiv preprint arXiv:2512.02486 (2025)

    Qiao, Z., Yang, R., Lyu, J., Li, X., Dai, Z., Yang, Z., Gao, S., Qiu, S.: Dual- robust cross-domain offline reinforcement learning against dynamics shifts. arXiv preprint arXiv:2512.02486 (2025)

  5. [5]

    Artificial Intelligence101(1-2), 99–134 (1998)

    Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial Intelligence101(1-2), 99–134 (1998)

  6. [6]

    IEEE Robotics and Automation Letters (2025)

    Jin, X., Zeng, C., Zhu, S., Liu, C., Cai, P.: Hi-drive: Hierarchical pomdp planning for safe autonomous driving in diverse urban environments. IEEE Robotics and Automation Letters (2025)

  7. [7]

    Wired magazine article

    Knight, W.: Snow and Ice Pose a Vexing Obstacle for Self-Driving Cars. Wired magazine article. February 3, 2020 (2020)

  8. [8]

    Chalmers University of Technology Doctoral Thesis (2022)

    Eidevag, T.: Snow contamination of cars: Adhesive particle collisions with exterior surfaces. Chalmers University of Technology Doctoral Thesis (2022). ISBN 978- 91-7905-666-7

  9. [9]

    arXiv:1910.12933 [cs]

    Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K., Heide, F.: Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11679–11689 (2020). https://doi.org/10. 1109/CVPR42600.2020.01170

  10. [10]

    arXiv preprint (2023)

    Schaetzen, R., Botros, A., Gash, R., Murrant, K., Smith, S.L.: Real-time naviga- tion for autonomous surface vehicles in ice-covered waters. arXiv preprint (2023). Available at arXiv:2302.11601

  11. [11]

    arXiv preprint arXiv:2510.00037 (2025)

    Guo, J., Wu, Z., Tu, C., Ma, Y., Kong, X., Liu, Z., Ji, J., Zhang, S., Chen, Y., Chen, K., et al.: On robustness of vision-language-action model against multi- modal perturbations. arXiv preprint arXiv:2510.00037 (2025)

  12. [12]

    Advances in Neural Information Processing Systems37, 12528–12580 (2024)

    Lu, M., Zhong, H., Zhang, T., Blanchet, J.: Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithms. Advances in Neural Information Processing Systems37, 12528–12580 (2024)

  13. [13]

    In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp

    Wachi, A., Shen, X., Sui, Y.: A survey of constraint formulations in safe reinforce- ment learning. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp. 8262–8271 (2024)

  14. [14]

    In: International Conference on Machine Learning, pp

    Lien, Y.-H., Hsieh, P.-C., Wang, Y.-S.: Revisiting domain randomization via relaxed state-adversarial policy optimization. In: International Conference on Machine Learning, pp. 20939–20949 (2023). PMLR

  15. [15]

    CoRR 17 (2024)

    As, Y., Sukhija, B., Treven, L., Sferrazza, C., Coros, S., Krause, A.: Actsafe: Active exploration with safety constraints for reinforcement learning. CoRR 17 (2024)

  16. [16]

    In: Second Workshop on Out-of- Distribution Generalization in Robotics at RSS 2025

    Seo, J., Nakamura, K., Bajcsy, A.: Unisafe: Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. In: Second Workshop on Out-of- Distribution Generalization in Robotics at RSS 2025

  17. [17]

    International Conference on Learning Representations (2023)

    Morad, S., Kortvelesy, R., Liwicki, S., Prorok, A.: Popgym: Benchmarking par- tially observable reinforcement learning. International Conference on Learning Representations (2023)

  18. [18]

    , Wang, W

    Jeong, H., Bae, J., Kim, M., et al.: Robust-gymnasium: A unified modular benchmark for robust reinforcement learning. arXiv preprint arXiv:2409.20521 (2024)

  19. [19]

    arXiv preprint arXiv:2112.03575 (2021)

    Luo, M., Balakrishna, A., Thananjeyan, B., Nair, S., Ibarz, J., Tan, J., Finn, C., Stoica, I., Goldberg, K.: Mesa: Offline meta-rl for safe adaptation and fault tolerance. arXiv preprint arXiv:2112.03575 (2021)

  20. [20]

    Advances in Neural Information Processing Systems24(2011)

    Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive pomdps. Advances in Neural Information Processing Systems24(2011)

  21. [21]

    PhD thesis, University of Massachusetts Amherst (2002)

    Duff, M.O.: Optimal learning: Computational procedures for bayes-adaptive markov decision processes. PhD thesis, University of Massachusetts Amherst (2002)

  22. [22]

    Nonnegative Decomposition of Multivariate Information

    Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate informa- tion. arXiv preprint arXiv:1004.2515 (2010)

  23. [23]

    Advances in Neural Information Processing Systems36, 19058–19072 (2023)

    Ma, X., Kang, B., Xu, Z., Lin, M., Yan, S.: Mutual information regularized offline reinforcement learning. Advances in Neural Information Processing Systems36, 19058–19072 (2023)

  24. [24]

    arXiv preprint arXiv:2509.10423 (2025)

    Reid, C., Hafez, W., Nazeri, A.: Mutual information tracks policy coherence in reinforcement learning. arXiv preprint arXiv:2509.10423 (2025)

  25. [25]

    Advances in Neural Information Processing Systems29(2016)

    Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via boot- strapped dqn. Advances in Neural Information Processing Systems29(2016)

  26. [26]

    Advances in neural information processing systems30(2017)

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems30(2017)

  27. [27]

    In: Advances in Neural Information Processing Systems, vol

    Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

  28. [28]

    Advances in neural information 18 processing systems33, 14129–14142 (2020)

    Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J.Y., Levine, S., Finn, C., Ma, T.: Mopo: Model-based offline policy optimization. Advances in neural information 18 processing systems33, 14129–14142 (2020)

  29. [29]

    Proceedings of the IEEE76(8), 966–1005 (1988)

    Bajcsy, R.: Active perception. Proceedings of the IEEE76(8), 966–1005 (1988)

  30. [30]

    IEEE Transactions on Robotics39(3), 1686– 1705 (2023)

    Placed, J.A., Strader, J., Carrillo, H., Atanasov, N., Indelman, V., Carlone, L., Castellanos, J.A.: A survey on active simultaneous localization and mapping: State of the art and new frontiers. IEEE Transactions on Robotics39(3), 1686– 1705 (2023)

  31. [31]

    In: International Conference on Machine Learning, pp

    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning, pp. 2778–2787 (2017). PMLR

  32. [32]

    IEEE Transactions on Cybernetics, 1–12 (2025) https://doi.org/10.1109/TCYB.2025.3637764

    Banerjee, C., Chen, Z., Noman, N.: Enhancing exploration in actor-critic algo- rithms: An approach to incentivize plausible novel states. IEEE Transactions on Cybernetics, 1–12 (2025) https://doi.org/10.1109/TCYB.2025.3637764

  33. [33]

    arXiv preprint arXiv:2506.09270 (2025)

    Carrasco-Davis, R., Lee, S., Clopath, C., Dabney, W.: Uncertainty prioritized experience replay. arXiv preprint arXiv:2506.09270 (2025)

  34. [34]

    In: International Conference on Machine Learning, pp

    Henaff, M., Jiang, M., Raileanu, R.: A study of global and episodic bonuses for exploration in contextual mdps. In: International Conference on Machine Learning, pp. 12972–12999 (2023). PMLR

  35. [35]

    In: Proceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pp

    Moss, R.J., Jamgochian, A., Fischer, J., Corso, A., Kochenderfer, M.J.: Con- strainedzero: chance-constrained pomdp planning using learned probabilistic failure surrogates and adaptive safety constraints. In: Proceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pp. 6752–6760 (2024)

  36. [36]

    In: International Conference on Machine Learning, pp

    Orvieto, A., Smith, S.L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., De, S.: Resurrecting recurrent neural networks for long sequences. In: International Conference on Machine Learning, pp. 26670–26698 (2023). PMLR

  37. [37]

    Proceedings of the IEEE 109(5), 612–634 (2021)

    Sch¨ olkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE 109(5), 612–634 (2021)

  38. [38]

    arXiv preprint arXiv:1911.10500 (2019)

    Sch¨ olkopf, B.: Causality for machine learning. arXiv preprint arXiv:1911.10500 (2019)

  39. [39]

    Cambridge University Press, ??? (2009)

    Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, ??? (2009)

  40. [40]

    Invariant Risk Minimization

    Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimiza- tion. arXiv preprint arXiv:1907.02893 (2019)

  41. [41]

    In: International Conference on Machine Learning, pp

    Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., 19 Le Priol, R., Courville, A.: Out-of-distribution generalization via risk extrapo- lation (rex). In: International Conference on Machine Learning, pp. 5815–5826 (2021). PMLR

  42. [42]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)

  43. [43]

    arXiv preprint arXiv:2510.11689 (2025)

    Wang, M., Tian, S., Swann, A., Shorinwa, O., Wu, J., Schwager, M.: Phys2real: Fusing vlm priors with interactive online adaptation for uncertainty-aware sim- to-real manipulation. arXiv preprint arXiv:2510.11689 (2025)

  44. [44]

    ArXiv abs/2309.05582(2023)

    Vlastelica, M., Blaes, S., Pinneri, C., Martius, G.: Mind the uncertainty: Risk-aware and actively exploring model-based reinforcement learning. ArXiv abs/2309.05582(2023)

  45. [45]

    In: The Twelfth International Conference on Learning Representations (2024)

    Hansen, N., Su, H., Wang, X.: TD-MPC2: Scalable, robust world models for continuous control. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=Oxh5CstDJU

  46. [46]

    Machine Learning110(9), 2419–2468 (2021)

    Dulac-Arnold, G., Levine, N., Mankowitz, D.J., Li, J., Paduraru, C., Gowal, S., Hester, T.: Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning110(9), 2419–2468 (2021)

  47. [47]

    Transactions on Machine Learning Research (2023)

    Benjamins, C., Eimer, T., Schubert, F., Mohan, A., D¨ ohler, S., Biedenkapp, A., Rosenhahn, B., Hutter, F., Lindauer, M.: Contextualize me – the case for context in reinforcement learning. Transactions on Machine Learning Research (2023)

  48. [48]

    arXiv preprint arXiv:2507.00257 (2025)

    Salaorni, D., De Paola, V., Delpero, S., Dispoto, G., Bonetti, P., Russo, A., Calcagno, G., Trov` o, F., Papini, M., Metelli, A.M., et al.: Gym4real: A suite for benchmarking real-world reinforcement learning. arXiv preprint arXiv:2507.00257 (2025)

  49. [49]

    Tao, R.Y., Guo, K., Allen, C., Konidaris, G.: Benchmarking partial observabil- ity in reinforcement learning with a suite of memory-improvable domains. arXiv preprint arXiv:2508.00046 (2025) Appendix A Information-Theoretic Grounding for the Tractable Approximation We show thatσθ+σs constitutes a provable, monotone upper bound onκ=I(s;θ|b t), and is there...