pith. sign in

arxiv: 2606.13132 · v1 · pith:6K52SXXPnew · submitted 2026-06-11 · 🧬 q-bio.NC

Including the Cost of Irreducible Uncertainty in the Policy Compression Framework

Pith reviewed 2026-06-27 05:15 UTC · model grok-4.3

classification 🧬 q-bio.NC
keywords policy compressioncognitive costconditional entropydecision makingmutual informationhuman biasesreaction times
0
0 comments X

The pith

Extending the policy compression framework to include the cost of conditional entropy sharpens optimal policies while keeping their exponential form.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the standard policy compression framework is incomplete because it treats conditional entropy as costless even though evidence links it to reaction times. It redefines total cognitive cost as the usual mutual information between states and actions plus a new weighted term eta times conditional entropy. The optimal policy that trades this cost against reward remains the familiar exponential form but grows sharper as eta rises. This change decouples how precisely a policy selects actions from how sensitive it is to reward differences. The authors note the extension may better capture human decision biases but adds a parameter that future fitting work must handle.

Core claim

Redefining cognitive cost as mutual information I(S;A) plus eta times conditional entropy H(A|S) yields an optimal policy of the standard exponential form whose sharpness increases with eta, allowing policy precision to vary independently of reward sensitivity and implying the original framework underestimates the cognitive cost of action selection under uncertainty.

What carries the argument

The augmented cognitive cost: mutual information between states and actions plus eta times conditional entropy of actions given states.

If this is right

  • The optimal policy retains the exponential form but becomes sharper as eta increases.
  • Policy precision can now be varied independently of reward sensitivity.
  • The standard framework may underestimate the cognitive cost of selecting actions under uncertainty.
  • The extension has potential to better account for observed biases in human decision-making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fitting the model to human data will require methods to estimate the extra parameter eta.
  • The framework could generate new predictions linking conditional entropy directly to measured reaction times.
  • Individual differences in eta might be testable against independent measures of cognitive capacity.

Load-bearing premise

Conditional entropy represents irreducible uncertainty that incurs its own cognitive cost because it modulates reaction times.

What would settle it

Behavioral data in which adding the eta-weighted conditional-entropy term fails to improve fits to human choices or reaction times relative to the original mutual-information cost alone.

Figures

Figures reproduced from arXiv: 2606.13132 by \'Alvaro Garrido-P\'erez, Amrapali Pednekar, Pieter Simoens, Yara Khaluf.

Figure 1
Figure 1. Figure 1: Examples of three policies: π1 (A), π2 (B), and π3 (C). For any non-deterministic state distribution (i.e., any distribution other than one in which a single state has P(si) = 1 and all others have probability zero), policy π1 has non-zero policy complexity, whereas policies π2 and π3 always have zero policy complexity. analogy, decision-making is like a noisy communication channel: the state of the world … view at source ↗
read the original abstract

AI decision-support systems can benefit from anticipating biases in human decision-making. Many such biases may arise from human cognitive limitations. The policy compression framework models decision-making as a trade-off between reward maximization and the cognitive cost of encoding state-dependent action policies, formalized as the mutual information between states and actions (policy complexity). We argue that this account is incomplete because it treats conditional entropy--the irreducible uncertainty about which action should be selected given a state--as costless, even though empirical evidence suggests that it modulates reaction times. We therefore extend the framework by defining cognitive cost as the sum of policy complexity and a weighted conditional-entropy term, governed by a new parameter, $\eta$. The resulting optimal policy retains the standard exponential form but becomes sharper as $\eta$ increases, allowing policy precision to vary more independently of reward sensitivity. This modification implies that the standard policy compression framework may underestimate the cognitive cost of action selection, and it has the potential to better account for biases in human decision-making. At the same time, it introduces additional complexity for fitting the model to human data, which future work will need to address.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript extends the policy compression framework by redefining cognitive cost as λ(I(S;A) + η H(A|S)) rather than λ I(S;A) alone, on the grounds that conditional entropy represents irreducible uncertainty with an empirically supported cognitive cost. It claims that the optimal policy under this objective retains the standard exponential (softmax) form while increasing in sharpness with η, thereby decoupling policy precision from reward sensitivity and potentially improving accounts of human decision biases. The work notes that the extension adds a free parameter η whose estimation will require future methodological attention.

Significance. If the derivation holds, the extension supplies a tractable mechanism for policy precision to vary independently of reward sensitivity, which could strengthen models of cognitive biases in decision support systems. A clear strength is the reported preservation of the closed-form exponential policy, which maintains analytical continuity with prior work in the framework. The significance is tempered by the ad-hoc status of the conditional-entropy cost axiom and the added fitting burden acknowledged in the abstract.

major comments (2)
  1. [Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.
  2. [Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.
minor comments (1)
  1. [Abstract] The abstract states that the modification 'has the potential to better account for biases' but does not indicate which specific biases or data sets would be re-analyzed; a short illustrative example would clarify the intended empirical payoff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We respond to each major comment below, indicating revisions where the manuscript will be updated to improve clarity and address limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the optimal policy 'retains the standard exponential form but becomes sharper as η increases' is load-bearing for the central contribution, yet the abstract provides neither the explicit policy equation nor the steps showing how the η H(A|S) term modifies the standard softmax without changing its functional class; verification requires the derivation to be exhibited.

    Authors: We agree that the abstract would be strengthened by including the explicit policy form to make the central claim immediately verifiable. The main text derives the optimal policy by augmenting the standard Lagrangian with the η H(A|S) term, which enters as an additive state-dependent shift that preserves the exponential (softmax) functional class while rescaling the effective inverse temperature. We will revise the abstract to state the resulting policy equation explicitly and briefly indicate that the η term modifies the normalization without altering the exponential structure. revision: yes

  2. Referee: [Abstract] Abstract (and implied main-text derivation): introduction of the free parameter η directly modulates policy sharpness, creating a circularity risk when the model is fit to behavioral data on precision; the manuscript does not supply an independent identification strategy (e.g., from reaction-time measures) that would allow η to be fixed before testing policy predictions.

    Authors: We acknowledge the identification challenge raised. The manuscript already notes in the abstract that the extension introduces additional complexity for fitting to human data, which future work must address. No independent identification strategy (such as fixing η from reaction-time data) is supplied in the present theoretical paper. We will expand the discussion section to explicitly flag this as a limitation and outline possible empirical approaches for separating η from λ in subsequent studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained mathematical extension

full rationale

The paper modifies the objective to E[r] − λ(I(S;A) + η H(A|S)) and states that the resulting optimal policy retains the standard exponential form while becoming sharper with η. This is a direct consequence of the Lagrangian optimization under the new cost definition; the form follows from the same variational steps used in the original framework and does not reduce to a fit or self-citation by construction. No load-bearing self-citations, uniqueness theorems, or renamed empirical patterns are invoked. The parameter η is introduced explicitly as a modeling choice, and the paper defers empirical fitting to future work. The central claim is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The extension rests on one new free parameter η and the domain assumption that conditional entropy incurs cognitive cost, plus the standard mutual information definition.

free parameters (1)
  • η
    Weighting parameter for the conditional entropy term in the total cognitive cost.
axioms (2)
  • domain assumption Policy complexity is measured by mutual information between states and actions.
    This is the standard assumption in the policy compression framework being extended.
  • ad hoc to paper Conditional entropy represents irreducible uncertainty that has a cognitive cost proportional to reaction times.
    This is the key new assumption introduced in the paper based on empirical evidence.

pith-pipeline@v0.9.1-grok · 5742 in / 1318 out tokens · 25227 ms · 2026-06-27T05:15:39.986922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references

  1. [1]

    Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources

    Lieder F, Griffiths TL. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and brain sciences. 2020;43:e1

  2. [2]

    Policy compression: An information bottleneck in action selection

    Lai L, Gershman SJ. Policy compression: An information bottleneck in action selection. In: Psychology of learning and motivation. vol. 74. Elsevier; 2021. p. 195-232

  3. [3]

    Origin of perseveration in the trade-off between reward and complexity

    Gershman SJ. Origin of perseveration in the trade-off between reward and complexity. Cognition. 2020;204:104394

  4. [4]

    The reward-complexity trade-off in schizophrenia

    Gershman SJ, Lai L. The reward-complexity trade-off in schizophrenia. Computational Psychiatry. 2021;5(1):38

  5. [5]

    Value-complexity tradeoff explains mouse navigational learning

    Amir N, Suliman-Lavie R, Tal M, Shifman S, Tishby N, Nelken I. Value-complexity tradeoff explains mouse navigational learning. PLOS Computational Biology. 2020;16(12):e1008497

  6. [6]

    Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects

    Liu S, Lai L, Gershman SJ, Bari BA. Time and memory costs jointly determine a speed–accuracy trade-off and set-size effects. Journal of Experimental Psychology: General. 2025

  7. [7]

    Policy complexity suppresses dopamine responses

    Gershman SJ, Lak A. Policy complexity suppresses dopamine responses. Journal of Neuroscience. 2025;45(9)

  8. [8]

    Rate distortion theory

    Cover TM, Thomas JA. Rate distortion theory. Elements of information theory. 1991:336-73

  9. [9]

    Bayesian reinforcement learning with limited cog- nitive load

    Arumugam D, Ho MK, Goodman ND, Van Roy B. Bayesian reinforcement learning with limited cog- nitive load. Open Mind. 2024;8:395-438

  10. [10]

    Dichotomous thinking and cognitive ability

    Mieda T, Taku K, Oshio A. Dichotomous thinking and cognitive ability. Personality and Individual Differences. 2021;169:110008

  11. [11]

    An information-theoretic perspective on the costs of cognition

    Zenon A, Solopchuk O, Pezzulo G. An information-theoretic perspective on the costs of cognition. Neuropsychologia. 2019;123:5-18

  12. [12]

    Cognitive effort and active inference

    Parr T, Holmes E, Friston KJ, Pezzulo G. Cognitive effort and active inference. Neuropsychologia. 2023;184:108562

  13. [13]

    A reinforcement learning diffusion decision model for value-based decisions

    Fontanesi L, Gluth S, Spektor MS, Rieskamp J. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review. 2019;26(4):1099-121

  14. [14]

    Human decision making balances reward maximization and policy compression

    Lai L, Gershman SJ. Human decision making balances reward maximization and policy compression. PLOS Computational Biology. 2024;20(4):e1012057

  15. [15]

    Action chunking as conditional policy compression

    Lai L, Huang AZ, Gershman SJ. Action chunking as conditional policy compression. Cognition. 2025;264:106201

  16. [16]

    Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach

    Garrido-P ´erez ´A, Lemoine V , Pednekar A, Khaluf Y , Simoens P. Cognitive Effort in the Two-Step Task: An Active Inference Drift-Diffusion Model Approach. In: International Workshop on Active Inference. Springer; 2025. p. 24-44

  17. [17]

    Neural correlates of span capacity during visual discrimination under varying cognitive demands

    Yao ZF, Yang MH, Hsieh S. Neural correlates of span capacity during visual discrimination under varying cognitive demands. Scientific reports. 2025;15(1):31071

  18. [18]

    Trading mental effort for confidence in the metacognitive control of value-based decision-making

    Lee DG, Daunizeau J. Trading mental effort for confidence in the metacognitive control of value-based decision-making. elife. 2021;10:e63282

  19. [19]

    An algorithm for computing the capacity of arbitrary discrete memoryless channels

    Arimoto S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory. 1972;18(1):14-20

  20. [20]

    Computation of channel capacity and rate-distortion functions

    Blahut RE, et al. Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory. 1972;18(4):460-73

  21. [21]

    Ten simple rules for the computational modeling of behavioral data

    Wilson RC, Collins AG. Ten simple rules for the computational modeling of behavioral data. elife. 2019;8:e49547

  22. [22]

    Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning

    McDougle SD, Collins AG. Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning. Psychonomic bulletin & review. 2021;28(1):20-39

  23. [23]

    A step-by-step tutorial on active inference and its application to empir- ical data

    Smith R, Friston KJ, Whyte CJ. A step-by-step tutorial on active inference and its application to empir- ical data. Journal of mathematical psychology. 2022;107:102632

  24. [24]

    Decision, Inference, and Information: Formal Equivalences Under Active Inference

    Sweeney P, Ruiz-Serra J, Harr ´e MS. Decision, Inference, and Information: Formal Equivalences Under Active Inference. Entropy. 2025;28(1):1

  25. [25]

    Leveraging artificial intel- ligence to improve people’s planning strategies

    Callaway F, Jain YR, van Opheusden B, Das P, Iwama G, Gul S, et al. Leveraging artificial intel- ligence to improve people’s planning strategies. Proceedings of the National Academy of Sciences. 2022;119(12):e2117432119

  26. [26]

    The nature and development of cognitive offloading in children

    Armitage KL, Gilbert SJ. The nature and development of cognitive offloading in children. Child Devel- opment Perspectives. 2025;19(2):108-15

  27. [27]

    Cognitive offloading

    Risko EF, Gilbert SJ. Cognitive offloading. Trends in cognitive sciences. 2016;20(9):676-88