arxiv: 2605.05091 · v1 · submitted 2026-05-06 · 🧬 q-bio.NC · cs.AI

Recognition: unknown

Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

Akshay K. Jagadish, Hanbo Xie, Lan Pan, Robert C. Wilson

Pith reviewed 2026-05-08 16:13 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI

keywords think-aloudcognitive model discoveryrisky decision makingautomated discoverylanguage datadecision modelsprocess tracing

0 comments

The pith

Think-aloud verbal data produces cognitive models of risky choice that differ structurally from and predict better than behavior-only models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether think-aloud verbal reports can supply additional constraints that resolve the under-determination of cognitive models discovered from behavioral data alone. In risky decision-making tasks, models built with both behavior and think-aloud traces predict held-out choices more accurately than models built from behavior alone. For most participants the new models fall into different structural families, moving from explicit comparator mechanisms to integrated utility mechanisms. This indicates that language data about internal processes can recover decision mechanisms that choice data by themselves leave ambiguous.

Core claim

When think-aloud traces are combined with behavioral data to guide large-language-model automated discovery of cognitive models, the resulting models for risky choice tasks achieve higher predictive accuracy on unseen data and belong to different structural families—shifting from explicit comparator models to integrated utility models—for 69.4 percent of participants. This shows that process-level verbal data systematically reshape discovered model structures beyond what behavior alone permits.

What carries the argument

Think-aloud verbalizations used as an extra constraint alongside behavioral trajectories in LLM-driven model discovery, distinguishing between structural classes such as explicit comparator and integrated utility.

If this is right

Models discovered with think-aloud data achieve significantly better predictive performance on held-out behavioral data.
The structural class of the discovered model changes for 69.4 percent of participants.
The change is specifically a shift from explicit comparator toward integrated utility representations.
Process-level language data make recoverable certain mechanisms that cannot be identified from behavior alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested in other domains where behavior alone leaves multiple models equally plausible.
Combining think-aloud with other process measures such as response times might further constrain model classes.
Future experiments could check whether the verbal reports themselves alter the strategies participants use.
The findings suggest that automated discovery pipelines should routinely incorporate language data when available to avoid under-determined models.

Load-bearing premise

Think-aloud verbalizations supply veridical evidence of the internal processes that produced the observed choices rather than post-hoc rationalizations or strategy changes induced by the reporting task.

What would settle it

A new held-out dataset from the same participants in which models discovered with think-aloud data show no improvement in predictive accuracy over behavior-only models, or in which the shift from explicit comparator to integrated utility structures disappears.

Figures

Figures reproduced from arXiv: 2605.05091 by Akshay K. Jagadish, Hanbo Xie, Lan Pan, Robert C. Wilson.

**Figure 1.** Figure 1: Think-aloud improves model discovery and induces systematic shifts in discovered mechanisms. A, Trial-averaged held-out BIC for each participant’s best discovered model under the behavior-only and think-aloud conditions. Lower BIC indicates better out-of-sample fit. Each pair of points is connected within participant; larger points show the group mean ± 95% CI. B, Schematic definitions of the three main me… view at source ↗

read the original abstract

Computational cognitive models discovered using large language models have so far relied solely on behavioral data. However, it is well-known that models produced from the behavioral trajectory alone are typically under-determined. In this work, we explore the use of Think Aloud traces as an additional form of data constraint during automated model discovery. When applied to the domain of risky decision-making, we find that the models discovered with think-aloud achieve significantly improved predictive performance on held-out data. Additionally, we find that the discovered models belong to different structural classes than those discovered from behavior alone for the majority of participants (69.4\%), specifically, it shifts from Explicit comparator towards Integrated utility. These results suggest that process-level language data not only improve model fit, but also systematically reshape the structure of the discovered cognitive models, enabling the identification of mechanisms that are not recoverable from behavior alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Think-aloud data shifts recovered model classes in risky choice and lifts held-out prediction, but without reactivity controls the shift could be an artifact rather than better mechanism recovery.

read the letter

The main takeaway is that feeding think-aloud verbalizations into an LLM-based model discovery pipeline for risky decisions changes the structural class of the best model for 69% of participants, moving them from explicit comparator to integrated utility, while also improving prediction on held-out choices. That is the concrete result the paper reports. It is new in combining automated discovery with process-level language data in this domain, and it directly addresses the under-determination problem that behavior-only fits usually face. The held-out prediction check is a solid way to show the extra data is doing something useful rather than just overfitting. The work is straightforward and stays empirical, avoiding closed-loop derivations. The 69% figure is large enough to notice if it replicates. What the paper does well is demonstrate a practical way to bring verbal protocols into the discovery loop without requiring hand-crafted features. That extension is useful for anyone already using LLM tools for cognitive modeling. The soft spots are real but not fatal. The abstract and available details give no participant count, no effect sizes or p-values for the prediction gain, and no description of how structural classes were labeled or how the discovery algorithm was run. More critically, there is no within-subject silent versus think-aloud control and no check that utterances align temporally with choices. If verbalizations are post-decision rationalizations or if talking alters strategy, both the class shift and the prediction improvement become hard to interpret as evidence of recovered internal processes. The stress-test concern holds on the information given. The paper is for computational cognitive scientists who already work with automated discovery or process tracing. A reader looking for ways to tighten model constraints beyond choice data would get value from the empirical demonstration, even if the causal interpretation needs more work. It deserves peer review because the core idea is testable, the held-out metric is appropriate, and the reported shift is large enough to matter if the methods hold up. I would send it out with requests for the missing controls and stats.

Referee Report

3 major / 2 minor

Summary. The paper claims that incorporating think-aloud verbal traces as additional constraints during LLM-based automated cognitive model discovery for risky decision-making yields models with significantly better held-out predictive performance than those derived from behavior alone, and induces a structural shift in 69.4% of participants from Explicit comparator to Integrated utility models, indicating that process-level language data can recover mechanisms under-determined by choices.

Significance. If the empirical results hold after addressing methodological gaps, the work would be significant for computational cognitive science by demonstrating a practical route to resolve under-determination in behavioral data via multi-modal constraints. It extends prior LLM discovery pipelines with a new data source and provides evidence that verbal protocols can systematically alter recovered model structures, which could improve the veridicality of discovered mechanisms in decision-making domains.

major comments (3)

[Abstract and Results] Abstract and Results sections: the central claims of 'significantly improved predictive performance' and a '69.4% structural shift' are reported without participant count (N), effect sizes, statistical tests (p-values or confidence intervals), or explicit criteria for assigning models to structural classes (Explicit comparator vs. Integrated utility). These omissions make it impossible to assess whether the reported gains and migration are robust or merely descriptive.
[Methods] Methods section: no within-subject control (silent vs. think-aloud trials), no temporal alignment analysis between utterances and choices, and no test for whether the LLM discovery procedure fits additional text tokens irrespective of their causal relation to the decision process. This directly bears on the interpretation that the predictive gain and class shift reflect recovered internal mechanisms rather than reactivity or post-hoc rationalization.
[Results] Results section: the specific claim that the shift is 'from Explicit comparator towards Integrated utility' requires quantitative detail on the classification procedure and a test of whether the 69.4% figure exceeds what would be expected from sampling variability or LLM prompt sensitivity alone.

minor comments (2)

[Methods] Provide the exact LLM prompts and integration method for combining behavioral trajectories with think-aloud text in the model discovery pipeline.
[Methods] Clarify how held-out prediction is operationalized (e.g., choice prediction accuracy metric and cross-validation scheme).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their thorough and constructive review. The comments have helped us improve the transparency of our statistical reporting, clarify methodological choices, and provide additional quantitative support for our claims. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results sections: the central claims of 'significantly improved predictive performance' and a '69.4% structural shift' are reported without participant count (N), effect sizes, statistical tests (p-values or confidence intervals), or explicit criteria for assigning models to structural classes (Explicit comparator vs. Integrated utility). These omissions make it impossible to assess whether the reported gains and migration are robust or merely descriptive.

Authors: We agree that these details are necessary for evaluating the robustness of the results. We have revised the abstract and Results sections to include the participant count, effect sizes with confidence intervals, and the outcomes of the relevant statistical tests. We have also added an explicit statement of the criteria used to assign models to the Explicit comparator versus Integrated utility structural classes, with cross-reference to the formal definitions provided in the Methods section. revision: yes
Referee: [Methods] Methods section: no within-subject control (silent vs. think-aloud trials), no temporal alignment analysis between utterances and choices, and no test for whether the LLM discovery procedure fits additional text tokens irrespective of their causal relation to the decision process. This directly bears on the interpretation that the predictive gain and class shift reflect recovered internal mechanisms rather than reactivity or post-hoc rationalization.

Authors: We acknowledge the value of within-subject controls for ruling out reactivity. Our between-subjects design was chosen to avoid carry-over effects from think-aloud instructions on subsequent trials; we have added a discussion of this rationale and its limitations to the revised Methods. We have also incorporated a new temporal alignment analysis showing the relationship between utterance timing and choice behavior. In addition, we added a control analysis in which think-aloud text was replaced by non-informative tokens, demonstrating that predictive gains require content-specific information rather than token volume alone. These elements are now reported in the Methods and Results. revision: partial
Referee: [Results] Results section: the specific claim that the shift is 'from Explicit comparator towards Integrated utility' requires quantitative detail on the classification procedure and a test of whether the 69.4% figure exceeds what would be expected from sampling variability or LLM prompt sensitivity alone.

Authors: We have expanded the Results section with a precise description of the classification procedure, including the decision rules applied to model code structure. To assess whether the observed 69.4% shift exceeds sampling variability, we have added a bootstrap/permutation test against a null distribution of random class reassignments. We have further tested robustness to prompt sensitivity by re-running the discovery pipeline under several prompt variants and reporting the consistency of the structural shift. These quantitative details and statistical evaluations are now included in the revised Results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on held-out behavioral prediction as external benchmark

full rationale

The paper performs automated model discovery separately on behavior-only data versus behavior-plus-think-aloud data, then evaluates both on held-out choice data that was never used in discovery. This supplies an independent test set, so improved predictive accuracy and the reported 69.4% structural-class shift are not forced by construction or by re-using the same fitted quantities. No equations appear that equate a reported performance gain to a parameter defined from the identical observations. Self-citations to prior LLM-discovery work are not load-bearing for the central empirical contrast, which is falsifiable on new participants or silent-condition controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that think-aloud reports are faithful read-outs of the decision process and that the LLM discovery procedure can reliably distinguish structural classes. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Think-aloud verbalizations provide non-reactive evidence of the latent cognitive processes that produced the observed choices.
Invoked implicitly when the authors treat think-aloud traces as additional constraints that reveal mechanisms 'not recoverable from behavior alone'.

pith-pipeline@v0.9.0 · 5453 in / 1415 out tokens · 29375 ms · 2026-05-08T16:13:01.013444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages

[1]

Elife , volume=

Ten simple rules for the computational modeling of behavioral data , author=. Elife , volume=. 2019 , publisher=

2019
[2]

The Journal of Machine Learning Research , volume=

A widely applicable Bayesian information criterion , author=. The Journal of Machine Learning Research , volume=. 2013 , publisher=

2013
[3]

, author=

Verbal reports as data. , author=. Psychological review , volume=. 1980 , publisher=

1980
[4]

First conference on language modeling , year=

From strategic narratives to code-like cognitive models: An llm-based approach in a sorting task , author=. First conference on language modeling , year=
[5]

NeurIPS 2023 AI for science workshop , year=

Text2Decision: Decoding latent variables in risky decision making from think aloud text , author=. NeurIPS 2023 AI for science workshop , year=

2023
[6]

Cognitive computational neuroscience (CCN) , year=

Evaluating predictive performance and learning efficiency of large language models with think aloud in risky decision making , author=. Cognitive computational neuroscience (CCN) , year=
[7]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Linking Strategies to Think Aloud in A Stochastic Learning Task , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=
[8]

arXiv preprint arXiv:2505.23931 , year=

Scaling up the think-aloud method , author=. arXiv preprint arXiv:2505.23931 , year=

work page arXiv
[9]

OSF Preprints , year=

Rethinking Think-Aloud in the Age of Language Models , author=. OSF Preprints , year=
[10]

arXiv preprint arXiv:2502.00879 , year=

Generating computational cognitive models using large language models , author=. arXiv preprint arXiv:2502.00879 , year=

work page arXiv
[11]

March , volume=

0Prospect Theory: An Analysis of Decision Under Risk, 1 Econometrica , author=. March , volume=
[12]

, author=

hdbscan: Hierarchical density based clustering. , author=. J. Open Source Softw. , volume=