Including the Cost of Irreducible Uncertainty in the Policy Compression Framework
Pith reviewed 2026-06-27 05:15 UTC · model grok-4.3
The pith
Extending the policy compression framework to include the cost of conditional entropy sharpens optimal policies while keeping their exponential form.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Redefining cognitive cost as mutual information I(S;A) plus eta times conditional entropy H(A|S) yields an optimal policy of the standard exponential form whose sharpness increases with eta, allowing policy precision to vary independently of reward sensitivity and implying the original framework underestimates the cognitive cost of action selection under uncertainty.
What carries the argument
The augmented cognitive cost: mutual information between states and actions plus eta times conditional entropy of actions given states.
If this is right
- The optimal policy retains the exponential form but becomes sharper as eta increases.
- Policy precision can now be varied independently of reward sensitivity.
- The standard framework may underestimate the cognitive cost of selecting actions under uncertainty.
- The extension has potential to better account for observed biases in human decision-making.
Where Pith is reading between the lines
- Fitting the model to human data will require methods to estimate the extra parameter eta.
- The framework could generate new predictions linking conditional entropy directly to measured reaction times.
- Individual differences in eta might be testable against independent measures of cognitive capacity.
Load-bearing premise
Conditional entropy represents irreducible uncertainty that incurs its own cognitive cost because it modulates reaction times.
What would settle it
Behavioral data in which adding the eta-weighted conditional-entropy term fails to improve fits to human choices or reaction times relative to the original mutual-information cost alone.
Figures
read the original abstract
AI decision-support systems can benefit from anticipating biases in human decision-making. Many such biases may arise from human cognitive limitations. The policy compression framework models decision-making as a trade-off between reward maximization and the cognitive cost of encoding state-dependent action policies, formalized as the mutual information between states and actions (policy complexity). We argue that this account is incomplete because it treats conditional entropy--the irreducible uncertainty about which action should be selected given a state--as costless, even though empirical evidence suggests that it modulates reaction times. We therefore extend the framework by defining cognitive cost as the sum of policy complexity and a weighted conditional-entropy term, governed by a new parameter, $\eta$. The resulting optimal policy retains the standard exponential form but becomes sharper as $\eta$ increases, allowing policy precision to vary more independently of reward sensitivity. This modification implies that the standard policy compression framework may underestimate the cognitive cost of action selection, and it has the potential to better account for biases in human decision-making. At the same time, it introduces additional complexity for fitting the model to human data, which future work will need to address.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the policy compression framework by redefining cognitive cost as λ(I(S;A) + η H(A|S)) rather than λ I(S;A) alone, on the grounds that conditional entropy represents irreducible uncertainty with an empirically supported cognitive cost. It claims that the optimal policy under this objective retains the standard exponential (softmax) form while increasing in sharpness with η, thereby decoupling policy precision from reward sensitivity and potentially improving accounts of human decision biases. The work notes that the extension adds a free parameter η whose estimation will require future methodological attention.
Significance. If the derivation holds, the extension supplies a tractable mechanism for policy precision to vary independently of reward sensitivity, which could strengthen models of cognitive biases in decision support systems. A clear strength is the reported preservation of the closed-form exponential policy, which maintains analytical continuity with prior work in the framework. The significance is tempered by the ad-hoc status of the conditional-entropy cost axiom and the added fitting burden acknowledged in the abstract.
major comments (2)
- [Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.
- [Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.
minor comments (1)
- [Abstract] The abstract states that the modification 'has the potential to better account for biases' but does not indicate which specific biases or data sets would be re-analyzed; a short illustrative example would clarify the intended empirical payoff.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We respond to each major comment below, indicating revisions where the manuscript will be updated to improve clarity and address limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.
Authors: We agree that the abstract would be strengthened by including the explicit policy form to make the central claim immediately verifiable. The main text derives the optimal policy by augmenting the standard Lagrangian with the η H(A|S) term, which enters as an additive state-dependent shift that preserves the exponential (softmax) functional class while rescaling the effective inverse temperature. We will revise the abstract to state the resulting policy equation explicitly and briefly indicate that the η term modifies the normalization without altering the exponential structure. revision: yes
-
Referee: [Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.
Authors: We acknowledge the identification challenge raised. The manuscript already notes in the abstract that the extension introduces additional complexity for fitting to human data, which future work must address. No independent identification strategy (such as fixing η from reaction-time data) is supplied in the present theoretical paper. We will expand the discussion section to explicitly flag this as a limitation and outline possible empirical approaches for separating η from λ in subsequent studies. revision: partial
Circularity Check
No significant circularity; derivation is self-contained mathematical extension
full rationale
The paper modifies the objective to E[r] − λ(I(S;A) + η H(A|S)) and states that the resulting optimal policy retains the standard exponential form while becoming sharper with η. This is a direct consequence of the Lagrangian optimization under the new cost definition; the form follows from the same variational steps used in the original framework and does not reduce to a fit or self-citation by construction. No load-bearing self-citations, uniqueness theorems, or renamed empirical patterns are invoked. The parameter η is introduced explicitly as a modeling choice, and the paper defers empirical fitting to future work. The central claim is therefore independent of its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- η
axioms (2)
- domain assumption Policy complexity is measured by mutual information between states and actions.
- ad hoc to paper Conditional entropy represents irreducible uncertainty that has a cognitive cost proportional to reaction times.
Reference graph
Works this paper leans on
-
[1]
Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources
Lieder F, Griffiths TL. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and brain sciences. 2020;43:e1
2020
-
[2]
Policy compression: An information bottleneck in action selection
Lai L, Gershman SJ. Policy compression: An information bottleneck in action selection. In: Psychology of learning and motivation. vol. 74. Elsevier; 2021. p. 195-232
2021
-
[3]
Origin of perseveration in the trade-off between reward and complexity
Gershman SJ. Origin of perseveration in the trade-off between reward and complexity. Cognition. 2020;204:104394
2020
-
[4]
The reward-complexity trade-off in schizophrenia
Gershman SJ, Lai L. The reward-complexity trade-off in schizophrenia. Computational Psychiatry. 2021;5(1):38
2021
-
[5]
Value-complexity tradeoff explains mouse navigational learning
Amir N, Suliman-Lavie R, Tal M, Shifman S, Tishby N, Nelken I. Value-complexity tradeoff explains mouse navigational learning. PLOS Computational Biology. 2020;16(12):e1008497
2020
-
[6]
Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects
Liu S, Lai L, Gershman SJ, Bari BA. Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects. Journal of Experimental Psychology: General. 2025
2025
-
[7]
Policy complexity suppresses dopamine responses
Gershman SJ, Lak A. Policy complexity suppresses dopamine responses. Journal of Neuroscience. 2025;45(9)
2025
-
[8]
Rate distortion theory
Cover TM, Thomas JA. Rate distortion theory. Elements of information theory. 1991:336-73
1991
-
[9]
Bayesian reinforcement learning with limited cog- nitive load
Arumugam D, Ho MK, Goodman ND, Van Roy B. Bayesian reinforcement learning with limited cog- nitive load. Open Mind. 2024;8:395-438
2024
-
[10]
Dichotomous thinking and cognitive ability
Mieda T, Taku K, Oshio A. Dichotomous thinking and cognitive ability. Personality and Individual Differences. 2021;169:110008
2021
-
[11]
An information-theoretic perspective on the costs of cognition
Zenon A, Solopchuk O, Pezzulo G. An information-theoretic perspective on the costs of cognition. Neuropsychologia. 2019;123:5-18
2019
-
[12]
Cognitive effort and active inference
Parr T, Holmes E, Friston KJ, Pezzulo G. Cognitive effort and active inference. Neuropsychologia. 2023;184:108562
2023
-
[13]
A reinforcement learning diffusion decision model for value-based decisions
Fontanesi L, Gluth S, Spektor MS, Rieskamp J. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review. 2019;26(4):1099-121
2019
-
[14]
Human decision making balances reward maximization and policy compression
Lai L, Gershman SJ. Human decision making balances reward maximization and policy compression. PLOS Computational Biology. 2024;20(4):e1012057
2024
-
[15]
Action chunking as conditional policy compression
Lai L, Huang AZ, Gershman SJ. Action chunking as conditional policy compression. Cognition. 2025;264:106201
2025
-
[16]
Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach
Garrido-P ´erez ´A, Lemoine V , Pednekar A, Khaluf Y , Simoens P. Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach. In: International Workshop on Active Inference. Springer; 2025. p. 24-44
2025
-
[17]
Neural correlates of span capacity during visual discrimination under varying cognitive demands
Yao ZF, Yang MH, Hsieh S. Neural correlates of span capacity during visual discrimination under varying cognitive demands. Scientific reports. 2025;15(1):31071
2025
-
[18]
Trading mental effort for confidence in the metacognitive control of value-based decision-making
Lee DG, Daunizeau J. Trading mental effort for confidence in the metacognitive control of value-based decision-making. elife. 2021;10:e63282
2021
-
[19]
An algorithm for computing the capacity of arbitrary discrete memoryless channels
Arimoto S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory. 1972;18(1):14-20
1972
-
[20]
Computation of channel capacity and rate-distortion functions
Blahut RE, et al. Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory. 1972;18(4):460-73
1972
-
[21]
Ten simple rules for the computational modeling of behavioral data
Wilson RC, Collins AG. Ten simple rules for the computational modeling of behavioral data. elife. 2019;8:e49547
2019
-
[22]
Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning
McDougle SD, Collins AG. Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning. Psychonomic bulletin & review. 2021;28(1):20-39
2021
-
[23]
A step-by-step tutorial on active inference and its application to empir- ical data
Smith R, Friston KJ, Whyte CJ. A step-by-step tutorial on active inference and its application to empir- ical data. Journal of mathematical psychology. 2022;107:102632
2022
-
[24]
Decision, Inference, and Information: Formal Equivalences Under Active Inference
Sweeney P, Ruiz-Serra J, Harr ´e MS. Decision, Inference, and Information: Formal Equivalences Under Active Inference. Entropy. 2025;28(1):1
2025
-
[25]
Leveraging artificial intel- ligence to improve people’s planning strategies
Callaway F, Jain YR, van Opheusden B, Das P, Iwama G, Gul S, et al. Leveraging artificial intel- ligence to improve people’s planning strategies. Proceedings of the National Academy of Sciences. 2022;119(12):e2117432119
2022
-
[26]
The nature and development of cognitive offloading in children
Armitage KL, Gilbert SJ. The nature and development of cognitive offloading in children. Child Devel- opment Perspectives. 2025;19(2):108-15
2025
-
[27]
Cognitive offloading
Risko EF, Gilbert SJ. Cognitive offloading. Trends in cognitive sciences. 2016;20(9):676-88
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.