Hypothesis Generation and Inductive Inference in Children and Language Models

Jeffrey Qin; Jessica Sommerville; Kevin Ellis; Marta Kryven; Mia Radovanovic; Wasu Top Piriyakulkij; Zhuangfei Gao

arxiv: 2605.24528 · v2 · pith:Q2FVLVFSnew · submitted 2026-05-23 · 💻 cs.AI · cs.CL· cs.LG

Hypothesis Generation and Inductive Inference in Children and Language Models

Jeffrey Qin , Wasu Top Piriyakulkij , Zhuangfei Gao , Mia Radovanovic , Jessica Sommerville , Kevin Ellis , Marta Kryven This is my paper

Pith reviewed 2026-06-30 13:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords inductive inferencehypothesis generationchildrenlanguage modelsBayesian inferenceprogram synthesisevidence reliabilityobservability

0 comments

The pith

Children and LLM agents both adapt inductive inference to evidence reliability and observability in a Box Task, though LLMs over-observe and over-comply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares children and LLM-based agents on an inductive inference task where participants must infer a latent cause through sequential interactions with an uncertain environment. It formalizes the task as program induction with Bayesian particle-based inference, allowing two views: constraint satisfaction over hypotheses or executable program synthesis against evidence. Children's behavior is accounted for by subjective evidence reliability paired with online hypothesis generation, which explains their evidence-seeking and the split between finishing the task and generalizing the rule. LLM agents mirror children's adjustments to reliability and observability changes, such as discounting weak evidence and resolving partial information, but they observe more and comply more strictly than children do.

Core claim

Using the constraint-based formulation, children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization, while tending to over-observe and over-comply relative to children.

What carries the argument

The Box Task formalized as program induction with Bayesian particle-based inference, viewed either as constraint satisfaction over hypotheses or as synthesis of executable programs evaluated against evidence.

If this is right

Children's evidence-seeking arises from subjective reliability judgments during ongoing hypothesis generation.
LLM agents can function as controllable model organisms for testing how inference changes with task conditions like observability.
Both groups separate completing the immediate task from achieving causal generalization under uncertainty.
Discounting of unreliable evidence occurs in both children and LLMs when reliability cues are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The similarity in adaptation patterns suggests LLMs could act as proxies for exploring developmental inference mechanisms under controlled manipulations.
Differences in observation volume point to distinct internal costs for information-seeking between children and current LLMs.
The dual formalization may allow the same task to probe other forms of uncertainty beyond evidence reliability.

Load-bearing premise

The Box Task and its formalization as program induction with Bayesian particle-based inference provide a faithful model of the underlying inductive processes used by both children and LLM agents.

What would settle it

If children in the Box Task fail to discount unreliable evidence or show no dissociation between task completion and rule generalization when evidence reliability is manipulated, the proposed explanation would not hold.

Figures

Figures reproduced from arXiv: 2605.24528 by Jeffrey Qin, Jessica Sommerville, Kevin Ellis, Marta Kryven, Mia Radovanovic, Wasu Top Piriyakulkij, Zhuangfei Gao.

**Figure 2.** Figure 2: A, Top row. Histograms showing the number of trials required to open all boxes for children [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Densities of number of attempts required to open all five boxes under each LLM-PS [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Representative hypothesis trajectories generated by LLM-PS-S (left) and LLM-PS-P (right). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Behavioral patterns in the Box Task (N = 100). Note that number of observations between LLM agents and children is not directly comparable, as we implemented reliable observations in LLM-based agents, which reveals full information about the box [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of OBSERVE actions per simulation (N = 100), shows for LLM-PS-P (GPT5.2, low reasoning). children. Failure to complete the task consistently reflected reaching a time-out, rather than giving up. ( [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of attempts at task termination (either all boxes opened or time elapsed) for [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical key reliability in the Box Task. Each bar shows the number of children falling [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Number of consecutive trials on which children (N = 100) repeated the same key-box pair. A repeat is defined as attempting the identical key–box combination on trial t + 1 as on trial t. Right: Proportion of consecutive trials on which children repeated the same key–box pair, aggregated across participants (N = 100). Repetitions above chance reflect persistent belief in a hypothesis despite failure, … view at source ↗

**Figure 10.** Figure 10: The frequency of first attempt per child across the full sample ( [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Top: SoC-Full parameter fits. NLL values are represented by color heatmap intensity [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Best-fitting model variant per child. Top row: Bar plots showing the number of children [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

read the original abstract

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds LLMs can track some child-like patterns on uncertain evidence in the Box Task, but the program-induction model is doing heavy lifting without clear validation.

read the letter

The core takeaway is that both children and LLM agents discount unreliable evidence, seek to resolve partial observations, and separate task completion from causal generalization on this inductive inference task. The work treats LLMs as model organisms to test ideas about online hypothesis generation under uncertainty.

What stands out is the controlled manipulation of evidence reliability and observability in a sequential Box Task, plus the dual formalization as constraint satisfaction over hypotheses and as executable program synthesis. This produces a clean empirical bridge between developmental data and LLM behavior, and the reported dissociation patterns are a concrete result that prior work on either side has not directly paired.

The modeling choice is the soft spot. The claim that children's behavior is best explained by subjective reliability plus online generation rests on the particle-based Bayesian setup, yet the abstract gives no quantitative model comparisons, alternative baselines, or process measures to show the formalization fits better than simpler accounts. The same mapping is assumed for the LLMs, where over-observation and over-compliance could just as easily reflect prompt effects rather than shared inductive machinery. Without those checks the replication story is suggestive but not secured.

This is for researchers already working on Bayesian models of children's causal reasoning or on using LLMs to simulate human inference. A reader in either area would get a usable experimental template and some dissociation data to think with.

It deserves peer review. The task and comparison are novel enough that referees can usefully pressure the modeling assumptions and ask for the missing controls.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Box Task for studying inductive inference under uncertainty, formalizing it as program induction via Bayesian particle-based inference with dual interpretations as constraint satisfaction over hypotheses and as executable program synthesis. It claims that children's sequential evidence-seeking and generalization behavior is best explained by subjective evidence reliability combined with online hypothesis generation, and that LLM-based agents, treated as model organisms, replicate children's responses to manipulations of evidence reliability and observability (discounting unreliable evidence, resolving partial information, dissociating task completion from causal generalization) while over-observing and over-complying relative to children.

Significance. If the formalization and behavioral mappings hold, the work offers a computational account of inductive processes in children and establishes LLMs as controllable systems for testing cognitive hypotheses, with the program-synthesis view enabling systematic condition manipulations. The explicit dual formalization and focus on information-seeking costs versus inductive biases are strengths that could inform both developmental psychology and AI alignment research.

major comments (3)

[Abstract and §3] Abstract and §3 (constraint-based formulation): The claim that children's behavior is 'best explained' by subjective evidence reliability plus online hypothesis generation is load-bearing for the central contribution, yet the manuscript provides no quantitative model-comparison results, likelihood ratios, or alternative baselines (e.g., standard Bayesian updating without online generation or fixed-reliability models) to establish superiority over other accounts.
[Abstract and Results] Abstract and Results (program-synthesis formulation): The assertion that LLM agents replicate children's responses to evidence reliability and observability changes rests on the assumption that the particle-filter hypothesis space and evidence-weighting dynamics match the effective computation in prompted LLMs; without reported process-level validation, ablation of the particle filter, or comparison to non-Bayesian LLM prompting strategies, the replication claim does not yet follow from the observed behavioral matches.
[Methods] Methods (Box Task formalization): The load-bearing modeling assumption that the Box Task and its Bayesian particle-based program-induction formalization faithfully capture the inductive processes of both children and LLMs lacks reported checks against confounds such as verbal-report alignment, eye-movement data, or alternative task decompositions; if this mapping does not hold, neither the 'best explanation' nor the cross-agent replication conclusions are secured.

minor comments (2)

[Abstract] Abstract: 'Across backends' is stated without naming the specific LLM families or versions used, which limits reproducibility assessment.
[Results] The dissociation between task completion and causal generalization is described qualitatively; a table or figure quantifying the effect sizes across conditions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our modeling and claims. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (constraint-based formulation): The claim that children's behavior is 'best explained' by subjective evidence reliability plus online hypothesis generation is load-bearing for the central contribution, yet the manuscript provides no quantitative model-comparison results, likelihood ratios, or alternative baselines (e.g., standard Bayesian updating without online generation or fixed-reliability models) to establish superiority over other accounts.

Authors: We agree that the 'best explained' phrasing would be strengthened by quantitative comparisons. The current results show that the model accounts for sequential evidence-seeking and the observed dissociation between task completion and generalization—patterns not predicted by standard Bayesian updating without online generation. In revision we will add likelihood-based model comparisons against the suggested baselines (fixed-reliability and non-online variants) to provide the requested quantitative support. revision: yes
Referee: [Abstract and Results] Abstract and Results (program-synthesis formulation): The assertion that LLM agents replicate children's responses to evidence reliability and observability changes rests on the assumption that the particle-filter hypothesis space and evidence-weighting dynamics match the effective computation in prompted LLMs; without reported process-level validation, ablation of the particle filter, or comparison to non-Bayesian LLM prompting strategies, the replication claim does not yet follow from the observed behavioral matches.

Authors: The replication claim is restricted to behavioral outcomes under identical task manipulations. We did not conduct process-level validation or ablations because prompted LLMs do not expose internal hypothesis spaces or particle-filter dynamics. We will revise the text to clarify that the reported matches are behavioral only, to discuss the limitations of this approach, and to note the absence of comparisons to non-Bayesian prompting strategies. revision: partial
Referee: [Methods] Methods (Box Task formalization): The load-bearing modeling assumption that the Box Task and its Bayesian particle-based program-induction formalization faithfully capture the inductive processes of both children and LLMs lacks reported checks against confounds such as verbal-report alignment, eye-movement data, or alternative task decompositions; if this mapping does not hold, neither the 'best explanation' nor the cross-agent replication conclusions are secured.

Authors: The formalization is supported by its ability to generate precise, testable predictions that match the key empirical dissociations in both populations. The study did not collect eye-movement data or additional verbal-report alignment measures. We will expand the Methods and Discussion sections to address potential confounds explicitly and to clarify the scope of the mapping, while acknowledging that direct process-level checks remain for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: modeling choice and behavioral comparison are independent of target claims

full rationale

The paper selects a formalization (program induction + Bayesian particle filter) as an interpretive lens for the Box Task, then reports how children's and LLM agents' observed behavior aligns with predictions from that lens under manipulated evidence reliability and observability. This is a standard modeling assumption followed by empirical comparison, not a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, uniqueness theorems, or ansatzes are shown reducing the central claims to their own inputs by construction. The derivation chain therefore remains self-contained against external behavioral data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the formalization as Bayesian particle-based inference is referenced but not detailed.

pith-pipeline@v0.9.1-grok · 6735 in / 1065 out tokens · 41512 ms · 2026-06-30T13:13:38.450378+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Using games to understand the mind.Nature human behaviour, 8(6):1035–1043, 2024

Kelsey Allen, Franziska Brändle, Matthew Botvinick, Judith E Fan, Samuel J Gershman, Alison Gopnik, Thomas L Griffiths, Joshua K Hartshorne, Tobias U Hauser, Mark K Ho, et al. Using games to understand the mind.Nature human behaviour, 8(6):1035–1043, 2024

2024
[2]

DeepCoder: Learning to Write Programs

Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs.arXiv preprint arXiv:1611.01989, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Win-stay, lose-sample: A simple sequential algorithm for approximating bayesian inference.Cognitive psychology, 74:35–65, 2014

Elizabeth Bonawitz, Stephanie Denison, Alison Gopnik, and Thomas L Griffiths. Win-stay, lose-sample: A simple sequential algorithm for approximating bayesian inference.Cognitive psychology, 74:35–65, 2014

2014
[4]

Probabilistic models, learning algorithms, and response variability: sampling in cognitive development.Trends in cognitive sciences, 18(10):497–500, 2014

Elizabeth Bonawitz, Stephanie Denison, Thomas L Griffiths, and Alison Gopnik. Probabilistic models, learning algorithms, and response variability: sampling in cognitive development.Trends in cognitive sciences, 18(10):497–500, 2014

2014
[5]

Active inductive inference in children and adults: A constructivist perspective

Neil R Bramley and Fei Xu. Active inductive inference in children and adults: A constructivist perspective. Cognition, 238:105471, 2023

2023
[6]

Formalizing neurath’s ship: Approximate algorithms for online causal learning.Psychological review, 124(3):301, 2017

Neil R Bramley, Peter Dayan, Thomas L Griffiths, and David A Lagnado. Formalizing neurath’s ship: Approximate algorithms for online causal learning.Psychological review, 124(3):301, 2017

2017
[7]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.CoRR, abs/1911.01547, 2019. URL http://arxiv. org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 1911
[8]

Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024

Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024. 10

2024
[9]

Generating code world models with large language models guided by monte carlo tree search

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15383

work page arXiv 2024
[10]

Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 05 2006

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 05 2006. ISSN 1369-7412. doi: 10. 1111/j.1467-9868.2006.00553.x. URLhttps://doi.org/10.1111/j.1467-9868.2006.00553.x

work page doi:10.1111/j.1467-9868.2006.00553.x 2006
[11]

Rational variability in children’s causal inferences: The sampling hypothesis.Cognition, 126(2):285–300, 2013

Stephanie Denison, Elizabeth Bonawitz, Alison Gopnik, and Thomas L Griffiths. Rational variability in children’s causal inferences: The sampling hypothesis.Cognition, 126(2):285–300, 2013

2013
[12]

Openly accessible llms can help us to understand human cognition.Nature Human Behaviour, 7(11):1825–1827, 2023

Michael C Frank. Openly accessible llms can help us to understand human cognition.Nature Human Behaviour, 7(11):1825–1827, 2023

2023
[13]

A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

2008
[14]

Childhood as a solution to explore–exploit tensions.Philosophical Transactions of the Royal Society B, 375(1803):20190502, 2020

Alison Gopnik. Childhood as a solution to explore–exploit tensions.Philosophical Transactions of the Royal Society B, 375(1803):20190502, 2020

2020
[15]

A theory of causal learning in children: causal maps and bayes nets.Psychological review, 111(1):3, 2004

Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets.Psychological review, 111(1):3, 2004

2004
[16]

Bayesian inference.The Oxford handbook of thinking and reasoning, pages 22–35, 2012

Thomas L Griffiths, Joshua B Tenenbaum, and Charles Kemp. Bayesian inference.The Oxford handbook of thinking and reasoning, pages 22–35, 2012

2012
[17]

Automating string processing in spreadsheets using input-output examples.SIGPLAN Not., 46(1):317–330, January 2011

Sumit Gulwani. Automating string processing in spreadsheets using input-output examples.SIGPLAN Not., 46(1):317–330, January 2011. ISSN 0362-1340. doi: 10.1145/1925844.1926423. URL https: //doi.org/10.1145/1925844.1926423

work page doi:10.1145/1925844.1926423 2011
[18]

Muggleton, Ute Schmid, and Benjamin Zorn

Sumit Gulwani, José Hernández-Orallo, Emanuel Kitzelmann, Stephen H. Muggleton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world.Commun. ACM, 58(11):90–99, October
[19]

doi: 10.1145/2736282

ISSN 0001-0782. doi: 10.1145/2736282. URLhttps://doi.org/10.1145/2736282

work page doi:10.1145/2736282
[20]

Number 6

Philip Nicholas Johnson-Laird.Mental models: Towards a cognitive science of language, inference, and consciousness. Number 6. Harvard University Press, 1983

1983
[21]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998

1998
[22]

Structured statistical models of inductive reasoning.Psychological review, 116(1):20, 2009

Charles Kemp and Joshua B Tenenbaum. Structured statistical models of inductive reasoning.Psychological review, 116(1):20, 2009

2009
[23]

Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

2015
[24]

Is programming by example solved by llms? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA,

Wen-Ding Li and Kevin Ellis. Is programming by example solved by llms? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA,
[25]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385
[26]

On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956

Dennis V Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956

1956
[27]

David Marr.Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, 1982

1982
[28]

PhD thesis, MIT, 2011

Steven Thomas Piantadosi.Learning and the language of thought. PhD thesis, MIT, 2011

2011
[29]

Doing experiments and revising rules with natural language and probabilistic reasoning.Advances in Neural Information Processing Systems, 37:53102–53137, 2024

Top Piriyakulkij, Cassidy Langenfeld, Tuan Anh Le, and Kevin Ellis. Doing experiments and revising rules with natural language and probabilistic reasoning.Advances in Neural Information Processing Systems, 37:53102–53137, 2024

2024
[30]

Poe-world: Compositional world modeling with products of programmatic experts.Advances in Neural Information Processing Systems (NeurIPS), 2025

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. Poe-world: Compositional world modeling with products of programmatic experts.Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[31]

Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement

Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, and Xiang Ren. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openre...

2024
[32]

Girls persist more but divest less from ineffective teaching than boys.Journal of Experimental Psychology: General, 2024

Mia Radovanovic, Ece Yucer, and Jessica A Sommerville. Girls persist more but divest less from ineffective teaching than boys.Journal of Experimental Psychology: General, 2024

2024
[33]

Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024

Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024

2024
[34]

PhD thesis, Massachusetts Institute of Technology, 2020

Joshua Stewart Rule.The child as hacker: building more human-like models of learning. PhD thesis, Massachusetts Institute of Technology, 2020

2020
[35]

A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

Herbert A Simon. A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

1955
[36]

Oxford University Press, 2023

Christopher Summerfield.Natural General Intelligence: How understanding the brain can help us build AI. Oxford University Press, 2023

2023
[37]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

2024
[38]

Tasks for aligning human and machine planning.Current Opinion in Behavioral Sciences, 29:127–133, 2019

Bas van Opheusden and Wei Ji Ma. Tasks for aligning human and machine planning.Current Opinion in Behavioral Sciences, 29:127–133, 2019

2019
[39]

One and done? optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

Edward Vul, Noah Goodman, Thomas L Griffiths, and Joshua B Tenenbaum. One and done? optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

2014
[40]

Hypothesis search: Inductive reasoning with language models

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=G7UtIGQmjm

2024
[41]

Re- Act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Irina Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[42]

Assessing adaptive world models in machines with novel games.URL https://arxiv.org/abs/2507.12821, 2025

Lance Ying, Katherine M Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J Gershman, Jacob D Andreas, et al. Assessing adaptive world models in machines with novel games.URL https://arxiv.org/abs/2507.12821, 2025. A The Box Task Setup The Box Task environment is introduced in [30]. Below we summari...

work page arXiv 2025
[43]

Practice: Participants tried plain keys on a plain practice box to familiarize themselves with the key action, including both correct and incorrect keys
[44]

This demonstration was confounded: for the red box only, the correct key happened to match both color and number

Instruction: Children viewed an instructional video in which a teacher demonstrated an in- correct color-matching rule, using a red-fobbed key to open the red box. This demonstration was confounded: for the red box only, the correct key happened to match both color and number. For all other boxes, color-matched keys were incorrect. The teacher communicate...
[45]

All 13 keys were placed in a single pile

Test: The five boxes were arranged in a fixed order in front of the participant. All 13 keys were placed in a single pile. Children were instructed to open all five boxes within a 5-minute time limit, working independently. The experimenter remained present but provided no feedback. Boxes could be picked up and examined. The test ended upon opening all fi...
[46]

1" and "2

Generalization: Four forced-choice trials using novel box images were presented on a tablet screen. For each box, four keys were presented: a color-matched key, a shape-matched key, a number-matched key (correct), and a number foil. Children selected which key they believed would open the box. 12 Box Task Environment.The boxes were uniquely colored physic...

[1] [1]

Using games to understand the mind.Nature human behaviour, 8(6):1035–1043, 2024

Kelsey Allen, Franziska Brändle, Matthew Botvinick, Judith E Fan, Samuel J Gershman, Alison Gopnik, Thomas L Griffiths, Joshua K Hartshorne, Tobias U Hauser, Mark K Ho, et al. Using games to understand the mind.Nature human behaviour, 8(6):1035–1043, 2024

2024

[2] [2]

DeepCoder: Learning to Write Programs

Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs.arXiv preprint arXiv:1611.01989, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Win-stay, lose-sample: A simple sequential algorithm for approximating bayesian inference.Cognitive psychology, 74:35–65, 2014

Elizabeth Bonawitz, Stephanie Denison, Alison Gopnik, and Thomas L Griffiths. Win-stay, lose-sample: A simple sequential algorithm for approximating bayesian inference.Cognitive psychology, 74:35–65, 2014

2014

[4] [4]

Probabilistic models, learning algorithms, and response variability: sampling in cognitive development.Trends in cognitive sciences, 18(10):497–500, 2014

Elizabeth Bonawitz, Stephanie Denison, Thomas L Griffiths, and Alison Gopnik. Probabilistic models, learning algorithms, and response variability: sampling in cognitive development.Trends in cognitive sciences, 18(10):497–500, 2014

2014

[5] [5]

Active inductive inference in children and adults: A constructivist perspective

Neil R Bramley and Fei Xu. Active inductive inference in children and adults: A constructivist perspective. Cognition, 238:105471, 2023

2023

[6] [6]

Formalizing neurath’s ship: Approximate algorithms for online causal learning.Psychological review, 124(3):301, 2017

Neil R Bramley, Peter Dayan, Thomas L Griffiths, and David A Lagnado. Formalizing neurath’s ship: Approximate algorithms for online causal learning.Psychological review, 124(3):301, 2017

2017

[7] [7]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.CoRR, abs/1911.01547, 2019. URL http://arxiv. org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 1911

[8] [8]

Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024

Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024. 10

2024

[9] [9]

Generating code world models with large language models guided by monte carlo tree search

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15383

work page arXiv 2024

[10] [10]

Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 05 2006

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 05 2006. ISSN 1369-7412. doi: 10. 1111/j.1467-9868.2006.00553.x. URLhttps://doi.org/10.1111/j.1467-9868.2006.00553.x

work page doi:10.1111/j.1467-9868.2006.00553.x 2006

[11] [11]

Rational variability in children’s causal inferences: The sampling hypothesis.Cognition, 126(2):285–300, 2013

Stephanie Denison, Elizabeth Bonawitz, Alison Gopnik, and Thomas L Griffiths. Rational variability in children’s causal inferences: The sampling hypothesis.Cognition, 126(2):285–300, 2013

2013

[12] [12]

Openly accessible llms can help us to understand human cognition.Nature Human Behaviour, 7(11):1825–1827, 2023

Michael C Frank. Openly accessible llms can help us to understand human cognition.Nature Human Behaviour, 7(11):1825–1827, 2023

2023

[13] [13]

A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

2008

[14] [14]

Childhood as a solution to explore–exploit tensions.Philosophical Transactions of the Royal Society B, 375(1803):20190502, 2020

Alison Gopnik. Childhood as a solution to explore–exploit tensions.Philosophical Transactions of the Royal Society B, 375(1803):20190502, 2020

2020

[15] [15]

A theory of causal learning in children: causal maps and bayes nets.Psychological review, 111(1):3, 2004

Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets.Psychological review, 111(1):3, 2004

2004

[16] [16]

Bayesian inference.The Oxford handbook of thinking and reasoning, pages 22–35, 2012

Thomas L Griffiths, Joshua B Tenenbaum, and Charles Kemp. Bayesian inference.The Oxford handbook of thinking and reasoning, pages 22–35, 2012

2012

[17] [17]

Automating string processing in spreadsheets using input-output examples.SIGPLAN Not., 46(1):317–330, January 2011

Sumit Gulwani. Automating string processing in spreadsheets using input-output examples.SIGPLAN Not., 46(1):317–330, January 2011. ISSN 0362-1340. doi: 10.1145/1925844.1926423. URL https: //doi.org/10.1145/1925844.1926423

work page doi:10.1145/1925844.1926423 2011

[18] [18]

Muggleton, Ute Schmid, and Benjamin Zorn

Sumit Gulwani, José Hernández-Orallo, Emanuel Kitzelmann, Stephen H. Muggleton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world.Commun. ACM, 58(11):90–99, October

[19] [19]

doi: 10.1145/2736282

ISSN 0001-0782. doi: 10.1145/2736282. URLhttps://doi.org/10.1145/2736282

work page doi:10.1145/2736282

[20] [20]

Number 6

Philip Nicholas Johnson-Laird.Mental models: Towards a cognitive science of language, inference, and consciousness. Number 6. Harvard University Press, 1983

1983

[21] [21]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998

1998

[22] [22]

Structured statistical models of inductive reasoning.Psychological review, 116(1):20, 2009

Charles Kemp and Joshua B Tenenbaum. Structured statistical models of inductive reasoning.Psychological review, 116(1):20, 2009

2009

[23] [23]

Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

2015

[24] [24]

Is programming by example solved by llms? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA,

Wen-Ding Li and Kevin Ellis. Is programming by example solved by llms? InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA,

[25] [25]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385

[26] [26]

On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956

Dennis V Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956

1956

[27] [27]

David Marr.Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, 1982

1982

[28] [28]

PhD thesis, MIT, 2011

Steven Thomas Piantadosi.Learning and the language of thought. PhD thesis, MIT, 2011

2011

[29] [29]

Doing experiments and revising rules with natural language and probabilistic reasoning.Advances in Neural Information Processing Systems, 37:53102–53137, 2024

Top Piriyakulkij, Cassidy Langenfeld, Tuan Anh Le, and Kevin Ellis. Doing experiments and revising rules with natural language and probabilistic reasoning.Advances in Neural Information Processing Systems, 37:53102–53137, 2024

2024

[30] [30]

Poe-world: Compositional world modeling with products of programmatic experts.Advances in Neural Information Processing Systems (NeurIPS), 2025

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. Poe-world: Compositional world modeling with products of programmatic experts.Advances in Neural Information Processing Systems (NeurIPS), 2025

2025

[31] [31]

Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement

Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, and Xiang Ren. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openre...

2024

[32] [32]

Girls persist more but divest less from ineffective teaching than boys.Journal of Experimental Psychology: General, 2024

Mia Radovanovic, Ece Yucer, and Jessica A Sommerville. Girls persist more but divest less from ineffective teaching than boys.Journal of Experimental Psychology: General, 2024

2024

[33] [33]

Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024

Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024

2024

[34] [34]

PhD thesis, Massachusetts Institute of Technology, 2020

Joshua Stewart Rule.The child as hacker: building more human-like models of learning. PhD thesis, Massachusetts Institute of Technology, 2020

2020

[35] [35]

A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

Herbert A Simon. A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

1955

[36] [36]

Oxford University Press, 2023

Christopher Summerfield.Natural General Intelligence: How understanding the brain can help us build AI. Oxford University Press, 2023

2023

[37] [37]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

2024

[38] [38]

Tasks for aligning human and machine planning.Current Opinion in Behavioral Sciences, 29:127–133, 2019

Bas van Opheusden and Wei Ji Ma. Tasks for aligning human and machine planning.Current Opinion in Behavioral Sciences, 29:127–133, 2019

2019

[39] [39]

One and done? optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

Edward Vul, Noah Goodman, Thomas L Griffiths, and Joshua B Tenenbaum. One and done? optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

2014

[40] [40]

Hypothesis search: Inductive reasoning with language models

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=G7UtIGQmjm

2024

[41] [41]

Re- Act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Irina Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023

[42] [42]

Assessing adaptive world models in machines with novel games.URL https://arxiv.org/abs/2507.12821, 2025

Lance Ying, Katherine M Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J Gershman, Jacob D Andreas, et al. Assessing adaptive world models in machines with novel games.URL https://arxiv.org/abs/2507.12821, 2025. A The Box Task Setup The Box Task environment is introduced in [30]. Below we summari...

work page arXiv 2025

[43] [43]

Practice: Participants tried plain keys on a plain practice box to familiarize themselves with the key action, including both correct and incorrect keys

[44] [44]

This demonstration was confounded: for the red box only, the correct key happened to match both color and number

Instruction: Children viewed an instructional video in which a teacher demonstrated an in- correct color-matching rule, using a red-fobbed key to open the red box. This demonstration was confounded: for the red box only, the correct key happened to match both color and number. For all other boxes, color-matched keys were incorrect. The teacher communicate...

[45] [45]

All 13 keys were placed in a single pile

Test: The five boxes were arranged in a fixed order in front of the participant. All 13 keys were placed in a single pile. Children were instructed to open all five boxes within a 5-minute time limit, working independently. The experimenter remained present but provided no feedback. Boxes could be picked up and examined. The test ended upon opening all fi...

[46] [46]

1" and "2

Generalization: Four forced-choice trials using novel box images were presented on a tablet screen. For each box, four keys were presented: a color-matched key, a shape-matched key, a number-matched key (correct), and a number foil. Children selected which key they believed would open the box. 12 Box Task Environment.The boxes were uniquely colored physic...