Including the Cost of Irreducible Uncertainty in the Policy Compression Framework

\'Alvaro Garrido-P\'erez; Amrapali Pednekar; Pieter Simoens; Yara Khaluf

arxiv: 2606.13132 · v1 · pith:6K52SXXPnew · submitted 2026-06-11 · 🧬 q-bio.NC

Including the Cost of Irreducible Uncertainty in the Policy Compression Framework

\'Alvaro Garrido-P\'erez , Pieter Simoens , Amrapali Pednekar , Yara Khaluf This is my paper

Pith reviewed 2026-06-27 05:15 UTC · model grok-4.3

classification 🧬 q-bio.NC

keywords policy compressioncognitive costconditional entropydecision makingmutual informationhuman biasesreaction times

0 comments

The pith

Extending the policy compression framework to include the cost of conditional entropy sharpens optimal policies while keeping their exponential form.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the standard policy compression framework is incomplete because it treats conditional entropy as costless even though evidence links it to reaction times. It redefines total cognitive cost as the usual mutual information between states and actions plus a new weighted term eta times conditional entropy. The optimal policy that trades this cost against reward remains the familiar exponential form but grows sharper as eta rises. This change decouples how precisely a policy selects actions from how sensitive it is to reward differences. The authors note the extension may better capture human decision biases but adds a parameter that future fitting work must handle.

Core claim

Redefining cognitive cost as mutual information I(S;A) plus eta times conditional entropy H(A|S) yields an optimal policy of the standard exponential form whose sharpness increases with eta, allowing policy precision to vary independently of reward sensitivity and implying the original framework underestimates the cognitive cost of action selection under uncertainty.

What carries the argument

The augmented cognitive cost: mutual information between states and actions plus eta times conditional entropy of actions given states.

If this is right

The optimal policy retains the exponential form but becomes sharper as eta increases.
Policy precision can now be varied independently of reward sensitivity.
The standard framework may underestimate the cognitive cost of selecting actions under uncertainty.
The extension has potential to better account for observed biases in human decision-making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fitting the model to human data will require methods to estimate the extra parameter eta.
The framework could generate new predictions linking conditional entropy directly to measured reaction times.
Individual differences in eta might be testable against independent measures of cognitive capacity.

Load-bearing premise

Conditional entropy represents irreducible uncertainty that incurs its own cognitive cost because it modulates reaction times.

What would settle it

Behavioral data in which adding the eta-weighted conditional-entropy term fails to improve fits to human choices or reaction times relative to the original mutual-information cost alone.

Figures

Figures reproduced from arXiv: 2606.13132 by \'Alvaro Garrido-P\'erez, Amrapali Pednekar, Pieter Simoens, Yara Khaluf.

**Figure 1.** Figure 1: Examples of three policies: π1 (A), π2 (B), and π3 (C). For any non-deterministic state distribution (i.e., any distribution other than one in which a single state has P(si) = 1 and all others have probability zero), policy π1 has non-zero policy complexity, whereas policies π2 and π3 always have zero policy complexity. analogy, decision-making is like a noisy communication channel: the state of the world … view at source ↗

read the original abstract

AI decision-support systems can benefit from anticipating biases in human decision-making. Many such biases may arise from human cognitive limitations. The policy compression framework models decision-making as a trade-off between reward maximization and the cognitive cost of encoding state-dependent action policies, formalized as the mutual information between states and actions (policy complexity). We argue that this account is incomplete because it treats conditional entropy--the irreducible uncertainty about which action should be selected given a state--as costless, even though empirical evidence suggests that it modulates reaction times. We therefore extend the framework by defining cognitive cost as the sum of policy complexity and a weighted conditional-entropy term, governed by a new parameter, $\eta$. The resulting optimal policy retains the standard exponential form but becomes sharper as $\eta$ increases, allowing policy precision to vary more independently of reward sensitivity. This modification implies that the standard policy compression framework may underestimate the cognitive cost of action selection, and it has the potential to better account for biases in human decision-making. At the same time, it introduces additional complexity for fitting the model to human data, which future work will need to address.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They add a new parameter η weighting conditional entropy as extra cognitive cost in the policy compression objective, keeping the exponential policy but letting precision vary separately from reward sensitivity.

read the letter

The core move is straightforward: they change the cost from λ I(S;A) to λ (I(S;A) + η H(A|S)) and show the optimal policy stays softmax-shaped but gets sharper as η grows. That is the actual novelty—no prior version of the framework had this separate knob for irreducible uncertainty.

It is a clean theoretical tweak. The authors correctly note that the standard account treats H(A|S) as free even though reaction-time data suggest otherwise, and the math works out without breaking the exponential form. That part is solid and worth having on record.

The soft spots are exactly what the abstract flags. There is no new data or fit to existing datasets here, so the claim that this will better explain biases rests on the untested assumption that conditional entropy really functions as an additive cost. Adding η also increases the fitting burden, which they acknowledge but do not solve. The circularity concern the reader raised is real: policy sharpness now depends on a fitted parameter whose value is not independently constrained.

This is for people already working inside the policy-compression or bounded-rationality literature in neuroscience and AI. A reader who cares about those models will want to see the derivation and think about whether η is worth the extra parameter. It is not a major shift, but the extension is internally consistent and addresses a gap that was easy to overlook.

I would send it to peer review. The math is transparent, the motivation is clear, and the limitation is stated up front; referees can decide whether the empirical case for the new term holds up.

Referee Report

2 major / 1 minor

Summary. The manuscript extends the policy compression framework by redefining cognitive cost as λ(I(S;A) + η H(A|S)) rather than λ I(S;A) alone, on the grounds that conditional entropy represents irreducible uncertainty with an empirically supported cognitive cost. It claims that the optimal policy under this objective retains the standard exponential (softmax) form while increasing in sharpness with η, thereby decoupling policy precision from reward sensitivity and potentially improving accounts of human decision biases. The work notes that the extension adds a free parameter η whose estimation will require future methodological attention.

Significance. If the derivation holds, the extension supplies a tractable mechanism for policy precision to vary independently of reward sensitivity, which could strengthen models of cognitive biases in decision support systems. A clear strength is the reported preservation of the closed-form exponential policy, which maintains analytical continuity with prior work in the framework. The significance is tempered by the ad-hoc status of the conditional-entropy cost axiom and the added fitting burden acknowledged in the abstract.

major comments (2)

[Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.
[Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.

minor comments (1)

[Abstract] The abstract states that the modification 'has the potential to better account for biases' but does not indicate which specific biases or data sets would be re-analyzed; a short illustrative example would clarify the intended empirical payoff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We respond to each major comment below, indicating revisions where the manuscript will be updated to improve clarity and address limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.

Authors: We agree that the abstract would be strengthened by including the explicit policy form to make the central claim immediately verifiable. The main text derives the optimal policy by augmenting the standard Lagrangian with the η H(A|S) term, which enters as an additive state-dependent shift that preserves the exponential (softmax) functional class while rescaling the effective inverse temperature. We will revise the abstract to state the resulting policy equation explicitly and briefly indicate that the η term modifies the normalization without altering the exponential structure. revision: yes
Referee: [Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.

Authors: We acknowledge the identification challenge raised. The manuscript already notes in the abstract that the extension introduces additional complexity for fitting to human data, which future work must address. No independent identification strategy (such as fixing η from reaction-time data) is supplied in the present theoretical paper. We will expand the discussion section to explicitly flag this as a limitation and outline possible empirical approaches for separating η from λ in subsequent studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained mathematical extension

full rationale

The paper modifies the objective to E[r] − λ(I(S;A) + η H(A|S)) and states that the resulting optimal policy retains the standard exponential form while becoming sharper with η. This is a direct consequence of the Lagrangian optimization under the new cost definition; the form follows from the same variational steps used in the original framework and does not reduce to a fit or self-citation by construction. No load-bearing self-citations, uniqueness theorems, or renamed empirical patterns are invoked. The parameter η is introduced explicitly as a modeling choice, and the paper defers empirical fitting to future work. The central claim is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The extension rests on one new free parameter η and the domain assumption that conditional entropy incurs cognitive cost, plus the standard mutual information definition.

free parameters (1)

η
Weighting parameter for the conditional entropy term in the total cognitive cost.

axioms (2)

domain assumption Policy complexity is measured by mutual information between states and actions.
This is the standard assumption in the policy compression framework being extended.
ad hoc to paper Conditional entropy represents irreducible uncertainty that has a cognitive cost proportional to reaction times.
This is the key new assumption introduced in the paper based on empirical evidence.

pith-pipeline@v0.9.1-grok · 5742 in / 1318 out tokens · 25227 ms · 2026-06-27T05:15:39.986922+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references

[1]

Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources

Lieder F, Griffiths TL. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and brain sciences. 2020;43:e1

2020
[2]

Policy compression: An information bottleneck in action selection

Lai L, Gershman SJ. Policy compression: An information bottleneck in action selection. In: Psychology of learning and motivation. vol. 74. Elsevier; 2021. p. 195-232

2021
[3]

Origin of perseveration in the trade-off between reward and complexity

Gershman SJ. Origin of perseveration in the trade-off between reward and complexity. Cognition. 2020;204:104394

2020
[4]

The reward-complexity trade-off in schizophrenia

Gershman SJ, Lai L. The reward-complexity trade-off in schizophrenia. Computational Psychiatry. 2021;5(1):38

2021
[5]

Value-complexity tradeoff explains mouse navigational learning

Amir N, Suliman-Lavie R, Tal M, Shifman S, Tishby N, Nelken I. Value-complexity tradeoff explains mouse navigational learning. PLOS Computational Biology. 2020;16(12):e1008497

2020
[6]

Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects

Liu S, Lai L, Gershman SJ, Bari BA. Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects. Journal of Experimental Psychology: General. 2025

2025
[7]

Policy complexity suppresses dopamine responses

Gershman SJ, Lak A. Policy complexity suppresses dopamine responses. Journal of Neuroscience. 2025;45(9)

2025
[8]

Rate distortion theory

Cover TM, Thomas JA. Rate distortion theory. Elements of information theory. 1991:336-73

1991
[9]

Bayesian reinforcement learning with limited cog- nitive load

Arumugam D, Ho MK, Goodman ND, Van Roy B. Bayesian reinforcement learning with limited cog- nitive load. Open Mind. 2024;8:395-438

2024
[10]

Dichotomous thinking and cognitive ability

Mieda T, Taku K, Oshio A. Dichotomous thinking and cognitive ability. Personality and Individual Differences. 2021;169:110008

2021
[11]

An information-theoretic perspective on the costs of cognition

Zenon A, Solopchuk O, Pezzulo G. An information-theoretic perspective on the costs of cognition. Neuropsychologia. 2019;123:5-18

2019
[12]

Cognitive effort and active inference

Parr T, Holmes E, Friston KJ, Pezzulo G. Cognitive effort and active inference. Neuropsychologia. 2023;184:108562

2023
[13]

A reinforcement learning diffusion decision model for value-based decisions

Fontanesi L, Gluth S, Spektor MS, Rieskamp J. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review. 2019;26(4):1099-121

2019
[14]

Human decision making balances reward maximization and policy compression

Lai L, Gershman SJ. Human decision making balances reward maximization and policy compression. PLOS Computational Biology. 2024;20(4):e1012057

2024
[15]

Action chunking as conditional policy compression

Lai L, Huang AZ, Gershman SJ. Action chunking as conditional policy compression. Cognition. 2025;264:106201

2025
[16]

Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach

Garrido-P ´erez ´A, Lemoine V , Pednekar A, Khaluf Y , Simoens P. Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach. In: International Workshop on Active Inference. Springer; 2025. p. 24-44

2025
[17]

Neural correlates of span capacity during visual discrimination under varying cognitive demands

Yao ZF, Yang MH, Hsieh S. Neural correlates of span capacity during visual discrimination under varying cognitive demands. Scientific reports. 2025;15(1):31071

2025
[18]

Trading mental effort for confidence in the metacognitive control of value-based decision-making

Lee DG, Daunizeau J. Trading mental effort for confidence in the metacognitive control of value-based decision-making. elife. 2021;10:e63282

2021
[19]

An algorithm for computing the capacity of arbitrary discrete memoryless channels

Arimoto S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory. 1972;18(1):14-20

1972
[20]

Computation of channel capacity and rate-distortion functions

Blahut RE, et al. Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory. 1972;18(4):460-73

1972
[21]

Ten simple rules for the computational modeling of behavioral data

Wilson RC, Collins AG. Ten simple rules for the computational modeling of behavioral data. elife. 2019;8:e49547

2019
[22]

Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning

McDougle SD, Collins AG. Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning. Psychonomic bulletin & review. 2021;28(1):20-39

2021
[23]

A step-by-step tutorial on active inference and its application to empir- ical data

Smith R, Friston KJ, Whyte CJ. A step-by-step tutorial on active inference and its application to empir- ical data. Journal of mathematical psychology. 2022;107:102632

2022
[24]

Decision, Inference, and Information: Formal Equivalences Under Active Inference

Sweeney P, Ruiz-Serra J, Harr ´e MS. Decision, Inference, and Information: Formal Equivalences Under Active Inference. Entropy. 2025;28(1):1

2025
[25]

Leveraging artificial intel- ligence to improve people’s planning strategies

Callaway F, Jain YR, van Opheusden B, Das P, Iwama G, Gul S, et al. Leveraging artificial intel- ligence to improve people’s planning strategies. Proceedings of the National Academy of Sciences. 2022;119(12):e2117432119

2022
[26]

The nature and development of cognitive offloading in children

Armitage KL, Gilbert SJ. The nature and development of cognitive offloading in children. Child Devel- opment Perspectives. 2025;19(2):108-15

2025
[27]

Cognitive offloading

Risko EF, Gilbert SJ. Cognitive offloading. Trends in cognitive sciences. 2016;20(9):676-88

2016

[1] [1]

Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources

Lieder F, Griffiths TL. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and brain sciences. 2020;43:e1

2020

[2] [2]

Policy compression: An information bottleneck in action selection

Lai L, Gershman SJ. Policy compression: An information bottleneck in action selection. In: Psychology of learning and motivation. vol. 74. Elsevier; 2021. p. 195-232

2021

[3] [3]

Origin of perseveration in the trade-off between reward and complexity

Gershman SJ. Origin of perseveration in the trade-off between reward and complexity. Cognition. 2020;204:104394

2020

[4] [4]

The reward-complexity trade-off in schizophrenia

Gershman SJ, Lai L. The reward-complexity trade-off in schizophrenia. Computational Psychiatry. 2021;5(1):38

2021

[5] [5]

Value-complexity tradeoff explains mouse navigational learning

Amir N, Suliman-Lavie R, Tal M, Shifman S, Tishby N, Nelken I. Value-complexity tradeoff explains mouse navigational learning. PLOS Computational Biology. 2020;16(12):e1008497

2020

[6] [6]

Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects

Liu S, Lai L, Gershman SJ, Bari BA. Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects. Journal of Experimental Psychology: General. 2025

2025

[7] [7]

Policy complexity suppresses dopamine responses

Gershman SJ, Lak A. Policy complexity suppresses dopamine responses. Journal of Neuroscience. 2025;45(9)

2025

[8] [8]

Rate distortion theory

Cover TM, Thomas JA. Rate distortion theory. Elements of information theory. 1991:336-73

1991

[9] [9]

Bayesian reinforcement learning with limited cog- nitive load

Arumugam D, Ho MK, Goodman ND, Van Roy B. Bayesian reinforcement learning with limited cog- nitive load. Open Mind. 2024;8:395-438

2024

[10] [10]

Dichotomous thinking and cognitive ability

Mieda T, Taku K, Oshio A. Dichotomous thinking and cognitive ability. Personality and Individual Differences. 2021;169:110008

2021

[11] [11]

An information-theoretic perspective on the costs of cognition

Zenon A, Solopchuk O, Pezzulo G. An information-theoretic perspective on the costs of cognition. Neuropsychologia. 2019;123:5-18

2019

[12] [12]

Cognitive effort and active inference

Parr T, Holmes E, Friston KJ, Pezzulo G. Cognitive effort and active inference. Neuropsychologia. 2023;184:108562

2023

[13] [13]

A reinforcement learning diffusion decision model for value-based decisions

Fontanesi L, Gluth S, Spektor MS, Rieskamp J. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review. 2019;26(4):1099-121

2019

[14] [14]

Human decision making balances reward maximization and policy compression

Lai L, Gershman SJ. Human decision making balances reward maximization and policy compression. PLOS Computational Biology. 2024;20(4):e1012057

2024

[15] [15]

Action chunking as conditional policy compression

Lai L, Huang AZ, Gershman SJ. Action chunking as conditional policy compression. Cognition. 2025;264:106201

2025

[16] [16]

Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach

Garrido-P ´erez ´A, Lemoine V , Pednekar A, Khaluf Y , Simoens P. Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach. In: International Workshop on Active Inference. Springer; 2025. p. 24-44

2025

[17] [17]

Neural correlates of span capacity during visual discrimination under varying cognitive demands

Yao ZF, Yang MH, Hsieh S. Neural correlates of span capacity during visual discrimination under varying cognitive demands. Scientific reports. 2025;15(1):31071

2025

[18] [18]

Trading mental effort for confidence in the metacognitive control of value-based decision-making

Lee DG, Daunizeau J. Trading mental effort for confidence in the metacognitive control of value-based decision-making. elife. 2021;10:e63282

2021

[19] [19]

An algorithm for computing the capacity of arbitrary discrete memoryless channels

Arimoto S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory. 1972;18(1):14-20

1972

[20] [20]

Computation of channel capacity and rate-distortion functions

Blahut RE, et al. Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory. 1972;18(4):460-73

1972

[21] [21]

Ten simple rules for the computational modeling of behavioral data

Wilson RC, Collins AG. Ten simple rules for the computational modeling of behavioral data. elife. 2019;8:e49547

2019

[22] [22]

Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning

McDougle SD, Collins AG. Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning. Psychonomic bulletin & review. 2021;28(1):20-39

2021

[23] [23]

A step-by-step tutorial on active inference and its application to empir- ical data

Smith R, Friston KJ, Whyte CJ. A step-by-step tutorial on active inference and its application to empir- ical data. Journal of mathematical psychology. 2022;107:102632

2022

[24] [24]

Decision, Inference, and Information: Formal Equivalences Under Active Inference

Sweeney P, Ruiz-Serra J, Harr ´e MS. Decision, Inference, and Information: Formal Equivalences Under Active Inference. Entropy. 2025;28(1):1

2025

[25] [25]

Leveraging artificial intel- ligence to improve people’s planning strategies

Callaway F, Jain YR, van Opheusden B, Das P, Iwama G, Gul S, et al. Leveraging artificial intel- ligence to improve people’s planning strategies. Proceedings of the National Academy of Sciences. 2022;119(12):e2117432119

2022

[26] [26]

The nature and development of cognitive offloading in children

Armitage KL, Gilbert SJ. The nature and development of cognitive offloading in children. Child Devel- opment Perspectives. 2025;19(2):108-15

2025

[27] [27]

Cognitive offloading

Risko EF, Gilbert SJ. Cognitive offloading. Trends in cognitive sciences. 2016;20(9):676-88

2016