Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

Han-yu Wang

arxiv: 2606.26502 · v1 · pith:E3UU2UKJnew · submitted 2026-06-25 · 💻 cs.AI · cs.CL

Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

Han-yu Wang This is my paper

Pith reviewed 2026-06-26 05:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords human-AI comparisonreasoning modelsdeliberation allocationresponse timeitem fixed effectsmetareasoningH-ARCstopping policy

0 comments

The pith

Humans spend less time on problems they fail while large reasoning models spend more, even though both track difficulty across problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes two components of deliberation: registration, the alignment of effort with difficulty across different items, and allocation, the decision within a single item to spend more or less effort on one's own successes versus failures. Humans and five tested thinking LRMs both reproduce the established cross-item correlation, but they diverge sharply once item identity is held fixed. Humans allocate less time to the trials they get wrong, consistent with disengagement from expected failures, while every LRM allocates substantially more time to its own errors. The pattern is isolated by item fixed-effects analysis on a matched corpus, absent in non-thinking baselines, and replicated across datasets. Because both policies produce the same cross-item correlation, earlier measures that ignored item identity could not detect the opposite control rules.

Core claim

On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment between deliberation effort and difficulty but diverge within items: every LRM shows a large wrong-versus-right effect while humans show the opposite sign. The comparison stays inside each agent's own scale and holds under item fixed effects.

What carries the argument

Item fixed-effects analysis that isolates within-item allocation (wrong-vs-right difference) from between-item registration (cross-item correlation with difficulty).

If this is right

The same cross-item correlation can arise from opposite within-item stopping policies.
Trace length in LRMs tracks uncertainty but not the control decision to persist or abandon.
Under resource-rational metareasoning the split is between two stopping policies that share a difficulty signal but implement opposite control.
The dissociation is absent in non-thinking baselines and replicates across datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may lack an explicit mechanism for abandoning items they expect to fail, leading to longer traces precisely on errors.
Training objectives that reward human-like disengagement on low-success items could reduce unnecessary computation.
The registration-allocation split offers a new diagnostic for comparing metacognitive policies across agents.

Load-bearing premise

The matched human-LRM corpus and item fixed-effects analysis successfully isolate within-item allocation from between-item difficulty differences without residual confounding from problem selection or measurement scale differences.

What would settle it

A replication on a new matched corpus or with an alternative matching procedure that eliminates the within-item wrong-versus-right reversal between humans and LRMs.

Figures

Figures reproduced from arXiv: 2606.26502 by Han-yu Wang.

**Figure 1.** Figure 1: The two-level diagnostic. (a) Schematic. An agent registers item difficulty (a perceptual or evaluative difficulty signal) and then allocates deliberation around that registered difficulty (a stopping or scheduling rule that decides whether to keep thinking on the item now in front of it). The cross-item alignment of de Varda et al. (2025) probes registration; the within-agent contrast introduced here pro… view at source ↗

**Figure 2.** Figure 2: The within-agent allocation gap on H-ARC. (a) Per-agent within-agent Cohen’s d on log deliberation (wrong − right). The asterisk on gpt-oss-20b flags its parser-limited status; Qwen3-235B-Thinking is omitted here and reported in Table S1. Failure-budget amplification (FBA) is the share of LRM compute spent on wrong trials, divided by the share of trials that are wrong. (b) Estimated marginal mean log deli… view at source ↗

**Figure 3.** Figure 3: Cross-paradigm allocation gap. Per-agent within-agent Cohen’s d on log deliberation (wrong − right) for the two non-saturated paradigms beyond H-ARC. INTUIT (intuitive physics) is the clean replication; Cortes (binary relational reasoning) is a paradigm-dependent boundary in which the agent-type interaction is preserved but the LRM-only within-item slope reverses sign (see Generalisation and the Cortes bou… view at source ↗

**Figure 4.** Figure 4: Two convergent behavioural probes of the proposed mechanism. The labels are mechanistic interpretations of behavioural signatures, not direct evidence of internal architecture. (a) Human engagement decomposition. The raw within-item correctness slope on H-ARC human log RT is β = +0.241; controlling for log(1 + actions) (the per-trial grid-action count) reduces it to β = +0.125, a 48% reduction. (b) LRM tra… view at source ↗

**Figure 5.** Figure 5: Illustrative truncation curves under two candidate principles (illustrative simulation, not a fit). A useful-search interpretation predicts a late, steep accuracy rise as the truncation budget grows; a length-on-uncertainty / padding interpretation predicts an earlier rise and a broad high-accuracy plateau under moderate truncation. The curves diverge most strongly between f = 0.5 and f = 0.8. 4.5 Limita… view at source ↗

read the original abstract

Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong. We separate two levels of deliberation: how response time tracks difficulty across items (registration), and, with item identity held fixed, whether an agent spends more on its own failures or successes (allocation). On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment (registration) but diverge within items (allocation): every LRM shows a large wrong-vs-right effect (Cohen's d = 1.47-3.13 on H-ARC) while humans show the opposite sign. The comparison stays inside each agent's own scale; we never put seconds and tokens on one axis. The dissociation holds under item fixed effects, replicates across datasets, and is absent in a non-thinking baseline. We read the human pattern as engagement versus abandonment: people stay on items they expect to solve and give up on the rest. We read the LRM pattern as length driven by uncertainty: chains grow when the model is unsure, which is exactly when it tends to fail. Both policies produce the same cross-item correlation with difficulty, so they look aligned on the measure prior work has used; the divergence shows up only once item identity is fixed. Under resource-rational metareasoning, the split is between two stopping policies that share a difficulty signal but implement opposite control; trace length captures the signal and misses the control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates cross-item difficulty tracking from within-item effort allocation and shows humans and LRMs move in opposite directions on the latter.

read the letter

The core observation is straightforward: both humans and the five thinking LRMs increase effort with problem difficulty when you compare across items, but the pattern reverses once you hold the item fixed. Models emit more tokens on the problems they get wrong; humans spend less time on the ones they miss. The paper documents this with item fixed-effects regressions, replication across datasets, and a non-thinking baseline that lacks the within-item effect. All comparisons stay inside each agent's native units.

The separation of registration from allocation is the useful move. Prior work mostly looked at the cross-item correlation, which both populations share, so the divergence only appears when item identity is controlled. The matched corpus supplies the necessary within-item outcome variation, and the reported controls appear sufficient to rule out simple scale or selection artifacts.

The main limitation is that the abstract does not give the exact matching procedure or raw statistics on how many items had outcome variation in both populations. That detail matters for judging residual confounding, though the stress-test note indicates the fixed-effects and replication checks address the obvious risks. The engagement-versus-uncertainty interpretation is reasonable but not required for the empirical claim.

This is worth a referee's time for anyone working on metareasoning, effort measures in evaluation, or cognitive modeling of AI systems. The result is falsifiable with the public data and adds a concrete distinction that existing benchmarks do not capture. I would send it to review.

Referee Report

0 major / 2 minor

Summary. The paper claims that humans and large reasoning models (LRMs) both exhibit positive cross-item registration of deliberation effort with problem difficulty, but diverge sharply in within-item allocation: LRMs expend more tokens on items they answer incorrectly (Cohen's d = 1.47–3.13), while humans expend less time on items they get wrong. This dissociation is reported to survive item fixed-effects regression, replicate across datasets, and be absent in a non-thinking baseline; the authors interpret it as humans engaging on solvable items versus abandoning the rest, versus LRMs lengthening chains under uncertainty.

Significance. If the dissociation holds, the work usefully separates registration from allocation in metareasoning and shows that the same cross-item correlation can arise from opposite stopping policies. Credit is due for the public matched corpus, item fixed-effects controls, replication across datasets, and explicit non-thinking baseline, all of which address obvious confounds and keep comparisons within each agent's native scale.

minor comments (2)

The abstract and methods should explicitly state the exact number of items per dataset, the precise matching procedure used to create the human-LRM corpus, and the full specification of the item fixed-effects regression (including any additional covariates) so that readers can directly verify isolation of within-item allocation.
Figure or table reporting the per-LRM Cohen's d values and human effect size should include confidence intervals and exact sample sizes per cell to allow assessment of precision.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were listed in the report, so we have no points requiring point-by-point response or manuscript changes.

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison

full rationale

The paper reports an empirical dissociation between cross-item registration and within-item allocation using a public matched human-LRM corpus, item fixed-effects regression, and Cohen's d effect sizes computed separately within each agent's native scale. No equations, derivations, or model fits are present; the central claims rest on direct statistical contrasts that survive the stated controls and replicate across datasets. No self-citations are load-bearing for the dissociation result, and the analysis introduces no self-definitional, fitted-input, or ansatz-smuggling steps. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no new theoretical parameters, axioms, or invented entities; relies on standard statistical methods and an existing public corpus.

axioms (1)

standard math Standard assumptions underlying Cohen's d effect size and linear fixed-effects models hold for the response-time and token-count data.
Invoked when reporting effect sizes and fixed-effects results.

pith-pipeline@v0.9.1-grok · 5835 in / 1276 out tokens · 31091 ms · 2026-06-26T05:33:19.316555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 2 linked inside Pith

[1]

, title =

Lieder, Falk and Griffiths, Thomas L. , title =. Behavioral and Brain Sciences , volume =. 2020 , doi =

2020
[2]

Artificial Intelligence , volume =

Russell, Stuart and Wefald, Eric , title =. Artificial Intelligence , volume =. 1991 , doi =

1991
[3]

, title =

Bogacz, Rafal and Brown, Eric and Moehlis, Jeff and Holmes, Philip and Cohen, Jonathan D. , title =. Psychological Review , volume =. 2006 , doi =

2006
[4]

Proceedings of the National Academy of Sciences , volume =

de Varda, Andrea Gregor and D'Elia, Ferdinando Pio and Kean, Hope and Lampinen, Andrew and Fedorenko, Evelina , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =

2025
[5]

Proceedings of the National Academy of Sciences , volume =

de Varda, Andrea Gregor and D'Elia, Ferdinando Pio and Kean, Hope and Lampinen, Andrew and Fedorenko, Evelina , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026
[6]

and Adolfi, Federico and Heaton, Rachel F

Vankov, Ivan I. and Adolfi, Federico and Heaton, Rachel F. and Puebla, Guillermo and Bowers, Jeffrey S. , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026
[7]

Proceedings of the National Academy of Sciences , volume =

No deep insights into the alignment between human and deep learning reasoning processes:. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026
[8]

Proceedings of the National Academy of Sciences , volume =

Hu, Yueqing , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026
[9]

and Gureckis, Todd M

LeGris, Solim and Vong, Wai Keen and Lake, Brenden M. and Gureckis, Todd M. , title =. Scientific Data , volume =. 2025 , doi =

2025
[10]

and O'Flynn, A

Prunty, J. and O'Flynn, A. and Quinn, P. and Cheke, L. G. , title =. 2025 , url =

2025
[11]

2021 , doi =

What Makes Mental Modeling Difficult? Normative Data for the Multidimensional Relational Reasoning Task , journal =. 2021 , doi =

2021
[12]

arXiv preprint arXiv:2412.19437 , year =

Pith/arXiv arXiv
[13]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and others and Zhang, Zhen , title =. Nature , volume =. 2025 , doi =

2025
[14]

2025 , howpublished =

2025
[15]

arXiv preprint arXiv:2510.18176 , year =

Samineni, Soumya Rani and Kalwar, Durgesh and Gangal, Vardaan and Bhambri, Siddhant and Kambhampati, Subbarao , title =. arXiv preprint arXiv:2510.18176 , year =

arXiv
[16]

2025 , doi =

Valmeekam, Karthik and Stechly, Kaya and Palod, Vardhan and Gundawar, Atharva and Kambhampati, Subbarao , title =. 2025 , doi =

2025
[17]

arXiv preprint arXiv:2504.09762 , year =

Kambhampati, Subbarao and Valmeekam, Karthik and Bhambri, Siddhant and Palod, Vardhan and Saldyt, Lucas Paul and Stechly, Kaya and Samineni, Soumya Rani and Kalwar, Durgesh and Biswas, Upasana , title =. arXiv preprint arXiv:2504.09762 , year =

Pith/arXiv arXiv
[18]

Neural Computation , volume =

Ratcliff, Roger and McKoon, Gail , title =. Neural Computation , volume =. 2008 , doi =

2008
[19]

, title =

Heitz, Richard P. , title =. Frontiers in Neuroscience , volume =. 2014 , doi =

2014
[20]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. 2022 , doi =

2022
[21]

2022 , doi =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. 2022 , doi =

2022
[22]

2025 , doi =

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , title =. 2025 , doi =

2025
[23]

On the measure of intelligence , journal =

Chollet, Fran. On the measure of intelligence , journal =. 2019 , doi =

2019
[24]

and Narens, Louis , title =

Nelson, Thomas O. and Narens, Louis , title =. Psychology of Learning and Motivation , editor =. 1990 , doi =

1990
[25]

Philosophical Transactions of the Royal Society

Yeung, Nick and Summerfield, Christopher , title =. Philosophical Transactions of the Royal Society. 2012 , doi =

2012
[26]

, title =

Simon, Herbert A. , title =. Psychological Review , volume =. 1956 , doi =

1956
[27]

and Horvitz, Eric J

Gershman, Samuel J. and Horvitz, Eric J. and Tenenbaum, Joshua B. , title =. Science , volume =. 2015 , doi =

2015
[28]

, title =

Efron, Bradley and Tibshirani, Robert J. , title =. 1993 , isbn =

1993
[29]

Econometrica , volume =

Mundlak, Yair , title =. Econometrica , volume =. 1978 , doi =

1978
[30]

Rational use of cognitive resources in human planning , journal =

Callaway, Frederick and. Rational use of cognitive resources in human planning , journal =. 2022 , doi =

2022
[31]

Proceedings of the National Academy of Sciences , volume =

Binz, Marcel and Schulz, Eric , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =

2023
[32]

and Leonesio, R

Nelson, Thomas O. and Leonesio, R. Jacob , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1988 , doi =

1988
[33]

, title =

Zhu, Jian-Qiao and Griffiths, Thomas L. , title =. Psychological Review , year =
[34]

Structural Safety , volume =

Der Kiureghian, Armen and Ditlevsen, Ove , title =. Structural Safety , volume =. 2009 , doi =

2009

[1] [1]

, title =

Lieder, Falk and Griffiths, Thomas L. , title =. Behavioral and Brain Sciences , volume =. 2020 , doi =

2020

[2] [2]

Artificial Intelligence , volume =

Russell, Stuart and Wefald, Eric , title =. Artificial Intelligence , volume =. 1991 , doi =

1991

[3] [3]

, title =

Bogacz, Rafal and Brown, Eric and Moehlis, Jeff and Holmes, Philip and Cohen, Jonathan D. , title =. Psychological Review , volume =. 2006 , doi =

2006

[4] [4]

Proceedings of the National Academy of Sciences , volume =

de Varda, Andrea Gregor and D'Elia, Ferdinando Pio and Kean, Hope and Lampinen, Andrew and Fedorenko, Evelina , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =

2025

[5] [5]

Proceedings of the National Academy of Sciences , volume =

de Varda, Andrea Gregor and D'Elia, Ferdinando Pio and Kean, Hope and Lampinen, Andrew and Fedorenko, Evelina , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026

[6] [6]

and Adolfi, Federico and Heaton, Rachel F

Vankov, Ivan I. and Adolfi, Federico and Heaton, Rachel F. and Puebla, Guillermo and Bowers, Jeffrey S. , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026

[7] [7]

Proceedings of the National Academy of Sciences , volume =

No deep insights into the alignment between human and deep learning reasoning processes:. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026

[8] [8]

Proceedings of the National Academy of Sciences , volume =

Hu, Yueqing , title =. Proceedings of the National Academy of Sciences , volume =. 2026 , doi =

2026

[9] [9]

and Gureckis, Todd M

LeGris, Solim and Vong, Wai Keen and Lake, Brenden M. and Gureckis, Todd M. , title =. Scientific Data , volume =. 2025 , doi =

2025

[10] [10]

and O'Flynn, A

Prunty, J. and O'Flynn, A. and Quinn, P. and Cheke, L. G. , title =. 2025 , url =

2025

[11] [11]

2021 , doi =

What Makes Mental Modeling Difficult? Normative Data for the Multidimensional Relational Reasoning Task , journal =. 2021 , doi =

2021

[12] [12]

arXiv preprint arXiv:2412.19437 , year =

Pith/arXiv arXiv

[13] [13]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and others and Zhang, Zhen , title =. Nature , volume =. 2025 , doi =

2025

[14] [14]

2025 , howpublished =

2025

[15] [15]

arXiv preprint arXiv:2510.18176 , year =

Samineni, Soumya Rani and Kalwar, Durgesh and Gangal, Vardaan and Bhambri, Siddhant and Kambhampati, Subbarao , title =. arXiv preprint arXiv:2510.18176 , year =

arXiv

[16] [16]

2025 , doi =

Valmeekam, Karthik and Stechly, Kaya and Palod, Vardhan and Gundawar, Atharva and Kambhampati, Subbarao , title =. 2025 , doi =

2025

[17] [17]

arXiv preprint arXiv:2504.09762 , year =

Kambhampati, Subbarao and Valmeekam, Karthik and Bhambri, Siddhant and Palod, Vardhan and Saldyt, Lucas Paul and Stechly, Kaya and Samineni, Soumya Rani and Kalwar, Durgesh and Biswas, Upasana , title =. arXiv preprint arXiv:2504.09762 , year =

Pith/arXiv arXiv

[18] [18]

Neural Computation , volume =

Ratcliff, Roger and McKoon, Gail , title =. Neural Computation , volume =. 2008 , doi =

2008

[19] [19]

, title =

Heitz, Richard P. , title =. Frontiers in Neuroscience , volume =. 2014 , doi =

2014

[20] [20]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. 2022 , doi =

2022

[21] [21]

2022 , doi =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. 2022 , doi =

2022

[22] [22]

2025 , doi =

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , title =. 2025 , doi =

2025

[23] [23]

On the measure of intelligence , journal =

Chollet, Fran. On the measure of intelligence , journal =. 2019 , doi =

2019

[24] [24]

and Narens, Louis , title =

Nelson, Thomas O. and Narens, Louis , title =. Psychology of Learning and Motivation , editor =. 1990 , doi =

1990

[25] [25]

Philosophical Transactions of the Royal Society

Yeung, Nick and Summerfield, Christopher , title =. Philosophical Transactions of the Royal Society. 2012 , doi =

2012

[26] [26]

, title =

Simon, Herbert A. , title =. Psychological Review , volume =. 1956 , doi =

1956

[27] [27]

and Horvitz, Eric J

Gershman, Samuel J. and Horvitz, Eric J. and Tenenbaum, Joshua B. , title =. Science , volume =. 2015 , doi =

2015

[28] [28]

, title =

Efron, Bradley and Tibshirani, Robert J. , title =. 1993 , isbn =

1993

[29] [29]

Econometrica , volume =

Mundlak, Yair , title =. Econometrica , volume =. 1978 , doi =

1978

[30] [30]

Rational use of cognitive resources in human planning , journal =

Callaway, Frederick and. Rational use of cognitive resources in human planning , journal =. 2022 , doi =

2022

[31] [31]

Proceedings of the National Academy of Sciences , volume =

Binz, Marcel and Schulz, Eric , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =

2023

[32] [32]

and Leonesio, R

Nelson, Thomas O. and Leonesio, R. Jacob , title =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =. 1988 , doi =

1988

[33] [33]

, title =

Zhu, Jian-Qiao and Griffiths, Thomas L. , title =. Psychological Review , year =

[34] [34]

Structural Safety , volume =

Der Kiureghian, Armen and Ditlevsen, Ove , title =. Structural Safety , volume =. 2009 , doi =

2009