Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

Akshay K. Jagadish; Brenden M. Lake; Liyi Zhang; Thomas L. Griffiths

arxiv: 2606.09856 · v1 · pith:KJITCPSEnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

Liyi Zhang , Akshay K. Jagadish , Brenden M. Lake , Thomas L. Griffiths This is my paper

Pith reviewed 2026-06-29 18:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords inductive reasoningprobabilistic programslarge language modelsfine-tuningsoft labelsuncertainty calibrationposterior trainingdistributional targets

0 comments

The pith

Program-based Posterior Training uses probabilistic programs to generate soft labels that improve LLMs on inductive reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Program-based Posterior Training (PPT) to train large language models on inductive reasoning tasks. Standard fine-tuning struggles with inductive problems because they involve uncertain, distributional targets and lack large labeled datasets. PPT has an LLM generate diverse scenarios as probabilistic programs, performs inference to create soft labels, and fine-tunes the model on those. Evaluations show improved accuracy on held-out tasks, better match to human judgments, and gains on external benchmarks that go beyond simple temperature scaling for calibration.

Core claim

PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. The gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling.

What carries the argument

Program-based Posterior Training (PPT), which uses LLMs to generate open-world scenarios as probabilistic programs and runs probabilistic inference to produce distributional target responses for fine-tuning.

If this is right

Improved performance on held-out inductive reasoning motifs
Increased alignment with human inductive judgments
Transfer of gains to external estimation and calibration benchmarks
Calibration improvements that persist beyond post-hoc temperature scaling

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid approaches combining symbolic probabilistic programs with neural fine-tuning could address other forms of uncertain reasoning in AI systems.
This method might reduce the need for expensive human-labeled data in training for real-world decision making under uncertainty.
Further work could test if the approach generalizes to more complex or domain-specific inductive problems beyond the generated scenarios.

Load-bearing premise

The distributional targets produced by the LLM-generated probabilistic programs and inference are faithful representations of real-world inductive reasoning rather than artifacts of the generation process.

What would settle it

If fine-tuning on PPT labels shows no improvement over baselines when evaluated on a set of inductive tasks constructed entirely from human-authored scenarios without LLM generation.

Figures

Figures reproduced from arXiv: 2606.09856 by Akshay K. Jagadish, Brenden M. Lake, Liyi Zhang, Thomas L. Griffiths.

**Figure 2.** Figure 2: Effects of Program-based Posterior Training. (a) Example held-out scenario and queries. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mean absolute error (MAE) where colors indicate different models, base and fine-tuned. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 3.** Figure 3: The Figure reports mean absolute error (MAE) on three evaluation sets: sports training [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-query mean response and standard-deviation between humans and LLMs on held-out [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean estimates on Jamal’s strength on eight scenarios that progressively goes from weaker [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPT offers a concrete pipeline for generating soft-label data on inductive tasks via LLM-written probabilistic programs, but the claim that these targets are faithful to real inductive reasoning rests on an unverified assumption.

read the letter

The paper's core contribution is a training loop that has an LLM write open-world probabilistic programs for 10k scenarios, runs inference to get distributional answers, and fine-tunes another model on those soft labels. This directly targets the data problem for inductive reasoning, where standard fine-tuning struggles because targets are not single correct answers.

What works is the framing. Deductive post-training has clear verifiers; inductive problems do not, and the authors correctly note that temperature scaling alone does not fix internal calibration. The evaluation plan—held-out motifs, human judgments, and external benchmarks—covers the right bases in principle.

The load-bearing weakness is the lack of evidence that the generated programs and their posteriors match actual human inductive reasoning rather than the generator LLM's own priors or syntactic habits. The abstract gives no description of how program correctness, coverage of ambiguity, or posterior calibration against independent human data was checked. If the programs systematically simplify uncertainty or inherit LLM biases on causal structure, then measured gains in accuracy and human alignment could be explained by the fine-tuned model learning the generator's idiosyncrasies.

Circularity is also not fully ruled out. Using the same model family to produce the programs and then train on their outputs creates dependence even with an inference step in between.

This is worth sending to referees. The idea addresses a genuine gap with a reproducible method, and the experiments are set up to test transfer. Reviewers can check the missing validation steps on the programs themselves and ask for the quantitative numbers that the abstract omits. Researchers working on LLM reasoning beyond math and code would get value from the details once those are filled in.

Referee Report

2 major / 1 minor

Summary. The paper introduces Program-based Posterior Training (PPT): an LLM generates diverse open-world scenarios as probabilistic programs, probabilistic inference produces distributional soft labels for queries, and the LLM is fine-tuned on 10k such labels. Evaluation on held-out motifs, human judgments, and external benchmarks is claimed to show substantial gains in estimation accuracy, human alignment, and calibration that are not explained by post-hoc temperature scaling.

Significance. If the central claims are substantiated, PPT offers a scalable route to distributional training targets for inductive reasoning without manual curation of labels, addressing a key limitation of standard fine-tuning on deductive tasks. The separation of program generation from inference and the demonstration that calibration gains exceed temperature scaling would be notable contributions to post-training methods for uncertainty-aware reasoning.

major comments (2)

[Abstract] Abstract: the claims that PPT 'substantially improves estimation accuracy on held-out inductive tasks' and 'increases alignment with human judgments' are presented without any quantitative results, confidence intervals, statistical tests, or ablation controls. This is load-bearing for the central claim because the abstract supplies the only evidence summary available for assessing whether the data support the reported gains.
[Abstract] Abstract (PPT pipeline description): the method assumes that inference over LLM-generated probabilistic programs yields distributional targets faithful to real-world inductive reasoning rather than artifacts of the generator LLM or program syntax. No validation of program correctness, coverage of ambiguity, or calibration against independent human data is described; if the programs systematically encode the generator's priors, measured improvements could reflect learning of those idiosyncrasies rather than general inductive inference.

minor comments (1)

[Abstract] The abstract refers to 'held-out motifs' and 'external benchmarks' without naming them or citing the specific datasets or prior work they come from.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that PPT 'substantially improves estimation accuracy on held-out inductive tasks' and 'increases alignment with human judgments' are presented without any quantitative results, confidence intervals, statistical tests, or ablation controls. This is load-bearing for the central claim because the abstract supplies the only evidence summary available for assessing whether the data support the reported gains.

Authors: We agree that the abstract would be strengthened by including quantitative results. In the revised manuscript we will update the abstract to report specific metrics (e.g., accuracy gains, alignment scores) with confidence intervals and references to the statistical comparisons performed in the main text. revision: yes
Referee: [Abstract] Abstract (PPT pipeline description): the method assumes that inference over LLM-generated probabilistic programs yields distributional targets faithful to real-world inductive reasoning rather than artifacts of the generator LLM or program syntax. No validation of program correctness, coverage of ambiguity, or calibration against independent human data is described; if the programs systematically encode the generator's priors, measured improvements could reflect learning of those idiosyncrasies rather than general inductive inference.

Authors: The manuscript reports evaluation on held-out motifs and human-labeled judgments, which directly tests generalization and alignment with independent human data. These results provide evidence that performance gains are not limited to artifacts of the generator. We will add a brief discussion of program-generation limitations and any additional checks on program validity in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes generating 10k scenarios as probabilistic programs via LLM, running inference to obtain distributional soft labels, and fine-tuning on those labels, with evaluation on held-out motifs, human judgments, and external benchmarks. No equations, definitions, or steps in the abstract reduce any claimed prediction or result to the inputs by construction (e.g., no self-definitional terms, no fitted parameter renamed as prediction, no load-bearing self-citation chain). The probabilistic inference step and independent external evaluations keep the chain self-contained; any dependence on the generator LLM is not exhibited as a formal equivalence in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; no equations or detailed methods available to enumerate free parameters or invented entities.

axioms (2)

domain assumption LLM can generate diverse open-world scenarios as probabilistic programs that are suitable for inference
Central to the PPT pipeline described in the abstract.
domain assumption Probabilistic inference on those programs yields distributional targets that improve inductive reasoning when used for fine-tuning
Required for the claim that soft labels transfer to held-out tasks and human judgments.

pith-pipeline@v0.9.1-grok · 5781 in / 1273 out tokens · 28486 ms · 2026-06-29T18:41:01.051035+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Red Hook, NY , USA, 2022

2022
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Bayesian models of cognition

Thomas L Griffiths, Charles Kemp, and Joshua B Tenenbaum. Bayesian models of cognition. InCambridge Handbook of Computational Cognitive Sciences. Cambridge University Press, 2008

2008
[4]

How to grow a mind: statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

2011
[5]

David M. Blei. Build, compute, critique, repeat: Data analysis with latent variable models. In Annual Review of Statistics and Its Application, 2014

2014
[6]

Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics)

Christopher M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006

2006
[7]

Human-like category learning by injecting ecological priors from large language models into neural networks

Akshay Kumar Jagadish, Julian Coda-Forno, Mirko Thalmann, Eric Schulz, and Marcel Binz. Human-like category learning by injecting ecological priors from large language models into neural networks. InForty-first International Conference on Machine Learning
[8]

Meta-learning ecological priors from large language models explains human learning and decision making

Akshay K Jagadish, Mirko Thalmann, Julian Coda-Forno, Marcel Binz, and Eric Schulz. Meta- learning ecological priors from large language models explains human learning and decision making.arXiv preprint arXiv:2509.00116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research (TMLR), 2025

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research (TMLR), 2025

2025
[10]

Generating computational cognitive models using large language models.Advances in Neural Information Processing Systems, 38:87796–87833, 2025

Milena Rmus, Akshay Kumar Jagadish, Marvin Mathony, Tobias Ludwig, and Eric Schulz. Generating computational cognitive models using large language models.Advances in Neural Information Processing Systems, 38:87796–87833, 2025

2025
[11]

Modeling open-world cognition as on-demand synthesis of probabilistic models

Lionel Wong, Katherine M Collins, Lance Ying, Cedegao E Zhang, Adrian Weller, Tobias Ger- stenberg, Timothy O’Donnell, Alexander K Lew, Jacob D Andreas, Joshua B Tenenbaum, and Tyler Brooke-Wilson. Modeling open-world cognition as on-demand synthesis of probabilistic models. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025

2025
[12]

OpenEstimate: Evaluating LLMs on reasoning under uncertainty with real-world data

Alana Marzoev, Jillian Ross, Michael Cafarella, and Jacob Andreas. OpenEstimate: Evaluating LLMs on reasoning under uncertainty with real-world data. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[13]

WinoGrande: an adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, August 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: an adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, August 2021

2021
[14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

2021
[15]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics

2022
[16]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 11

2019
[17]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications, 17(1):1238, 2026

Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications, 17(1):1238, 2026

2026
[19]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

2017
[20]

Lake, Ruslan Salakhutdinov, and Joshua B

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

2015
[21]

Lew, Noah D

Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought, 2023

2023
[22]

Concepts in a probabilistic language of thought

Noah D Goodman, Joshua B Tenenbaum, and Tobias Gerstenberg. Concepts in a probabilistic language of thought. InIn Proceedings of the 36th Annual Conference of the Cognitive Science Society, 2014

2014
[23]

Hypothesis search: Inductive reasoning with language models

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[24]

Language models as inductive reasoners

Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jianfeng Gao, and Furu Wei. Language models as inductive reasoners. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 209–225. Association for Computat...

2024
[25]

Foundation posteriors for approximate probabilistic inference

Mike Wu and Noah Goodman. Foundation posteriors for approximate probabilistic inference. InAdvances in Neural Information Processing Systems, 2022

2022
[26]

The design and implementation of probabilistic programming languages.http://dippl.org, 2014

Noah D Goodman and Andreas Stuhlmüller. The design and implementation of probabilistic programming languages.http://dippl.org, 2014

2014
[27]

An explanation of in-context learning as implicit Bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit Bayesian inference. InInternational Conference on Learning Representations, 2022

2022
[28]

Liyi Zhang, Michael Y . Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, and Thomas L. Griffiths. What should embeddings embed? Autoregressive models represent latent generating distributions.Transactions on Machine Learning Research, 2025. Featured Certification

2025
[29]

Amortizing intractable inference in large language models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In International Conference on Learning Representations, 2024

2024
[30]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

2025
[31]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InEmpirical Methods in Natural Language Processing (EMNLP), 2023

2023
[32]

Wornell, and Soumya Ghosh

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory W. Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference ...

2024
[33]

Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D

Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. Pyro: Deep Universal Probabilistic Programming.Journal of Machine Learning Research, 20:1–6, 2019

2019
[34]

A Study of LLMs' Preferences for Libraries and Programming Languages

Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. LLMs love Python: A study of LLMs’ bias for programming languages and libraries.arXiv preprint arXiv:2503.17181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[36]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

2024
[37]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren 13 Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, ...

2024
[38]

Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

work page arXiv 2023
[39]

A beat B

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. 14 A Technical appendices and supplementary material A.1 Implementational details HyperparametersHere we detail the experimental setup used in our experiments. We use PyTorch with Torchtune to fine-tune LLAMA-3-8B-INSTRUCT...

2019
[40]

out of 100 athletes, where do they rank?

Estimating an athlete’s intrinsic strength as a distribution or percentile rank (e.g., “out of 100 athletes, where do they rank?”)
[41]

Inferring a latent match factor on a 0–100 scale for a specific match (effort / teamwork / fatigue / etc.)
[42]

surprising

Predicting outcomes of 1–2 hypothetical future matches, including *who wins* and *by how much* qualitatively (or as a probability), while acknowledging uncertainty. Important: each query should be answerable by one number. Wrong examples: 1) explicitly asking for verbal justification; 2) asking about two players’ intrinsic strengths within one query. Howe...

[1] [1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Red Hook, NY , USA, 2022

2022

[2] [2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Bayesian models of cognition

Thomas L Griffiths, Charles Kemp, and Joshua B Tenenbaum. Bayesian models of cognition. InCambridge Handbook of Computational Cognitive Sciences. Cambridge University Press, 2008

2008

[4] [4]

How to grow a mind: statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

2011

[5] [5]

David M. Blei. Build, compute, critique, repeat: Data analysis with latent variable models. In Annual Review of Statistics and Its Application, 2014

2014

[6] [6]

Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics)

Christopher M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006

2006

[7] [7]

Human-like category learning by injecting ecological priors from large language models into neural networks

Akshay Kumar Jagadish, Julian Coda-Forno, Mirko Thalmann, Eric Schulz, and Marcel Binz. Human-like category learning by injecting ecological priors from large language models into neural networks. InForty-first International Conference on Machine Learning

[8] [8]

Meta-learning ecological priors from large language models explains human learning and decision making

Akshay K Jagadish, Mirko Thalmann, Julian Coda-Forno, Marcel Binz, and Eric Schulz. Meta- learning ecological priors from large language models explains human learning and decision making.arXiv preprint arXiv:2509.00116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research (TMLR), 2025

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research (TMLR), 2025

2025

[10] [10]

Generating computational cognitive models using large language models.Advances in Neural Information Processing Systems, 38:87796–87833, 2025

Milena Rmus, Akshay Kumar Jagadish, Marvin Mathony, Tobias Ludwig, and Eric Schulz. Generating computational cognitive models using large language models.Advances in Neural Information Processing Systems, 38:87796–87833, 2025

2025

[11] [11]

Modeling open-world cognition as on-demand synthesis of probabilistic models

Lionel Wong, Katherine M Collins, Lance Ying, Cedegao E Zhang, Adrian Weller, Tobias Ger- stenberg, Timothy O’Donnell, Alexander K Lew, Jacob D Andreas, Joshua B Tenenbaum, and Tyler Brooke-Wilson. Modeling open-world cognition as on-demand synthesis of probabilistic models. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025

2025

[12] [12]

OpenEstimate: Evaluating LLMs on reasoning under uncertainty with real-world data

Alana Marzoev, Jillian Ross, Michael Cafarella, and Jacob Andreas. OpenEstimate: Evaluating LLMs on reasoning under uncertainty with real-world data. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[13] [13]

WinoGrande: an adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, August 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: an adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, August 2021

2021

[14] [14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

2021

[15] [15]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics

2022

[16] [16]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 11

2019

[17] [17]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications, 17(1):1238, 2026

Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications, 17(1):1238, 2026

2026

[19] [19]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

2017

[20] [20]

Lake, Ruslan Salakhutdinov, and Joshua B

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

2015

[21] [21]

Lew, Noah D

Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought, 2023

2023

[22] [22]

Concepts in a probabilistic language of thought

Noah D Goodman, Joshua B Tenenbaum, and Tobias Gerstenberg. Concepts in a probabilistic language of thought. InIn Proceedings of the 36th Annual Conference of the Cognitive Science Society, 2014

2014

[23] [23]

Hypothesis search: Inductive reasoning with language models

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[24] [24]

Language models as inductive reasoners

Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jianfeng Gao, and Furu Wei. Language models as inductive reasoners. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 209–225. Association for Computat...

2024

[25] [25]

Foundation posteriors for approximate probabilistic inference

Mike Wu and Noah Goodman. Foundation posteriors for approximate probabilistic inference. InAdvances in Neural Information Processing Systems, 2022

2022

[26] [26]

The design and implementation of probabilistic programming languages.http://dippl.org, 2014

Noah D Goodman and Andreas Stuhlmüller. The design and implementation of probabilistic programming languages.http://dippl.org, 2014

2014

[27] [27]

An explanation of in-context learning as implicit Bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit Bayesian inference. InInternational Conference on Learning Representations, 2022

2022

[28] [28]

Liyi Zhang, Michael Y . Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, and Thomas L. Griffiths. What should embeddings embed? Autoregressive models represent latent generating distributions.Transactions on Machine Learning Research, 2025. Featured Certification

2025

[29] [29]

Amortizing intractable inference in large language models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In International Conference on Learning Representations, 2024

2024

[30] [30]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

2025

[31] [31]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InEmpirical Methods in Natural Language Processing (EMNLP), 2023

2023

[32] [32]

Wornell, and Soumya Ghosh

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory W. Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference ...

2024

[33] [33]

Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D

Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. Pyro: Deep Universal Probabilistic Programming.Journal of Machine Learning Research, 20:1–6, 2019

2019

[34] [34]

A Study of LLMs' Preferences for Libraries and Programming Languages

Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. LLMs love Python: A study of LLMs’ bias for programming languages and libraries.arXiv preprint arXiv:2503.17181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022

[36] [36]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

2024

[37] [37]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren 13 Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, ...

2024

[38] [38]

Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638, 2023

work page arXiv 2023

[39] [39]

A beat B

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. 14 A Technical appendices and supplementary material A.1 Implementational details HyperparametersHere we detail the experimental setup used in our experiments. We use PyTorch with Torchtune to fine-tune LLAMA-3-8B-INSTRUCT...

2019

[40] [40]

out of 100 athletes, where do they rank?

Estimating an athlete’s intrinsic strength as a distribution or percentile rank (e.g., “out of 100 athletes, where do they rank?”)

[41] [41]

Inferring a latent match factor on a 0–100 scale for a specific match (effort / teamwork / fatigue / etc.)

[42] [42]

surprising

Predicting outcomes of 1–2 hypothetical future matches, including *who wins* and *by how much* qualitatively (or as a probability), while acknowledging uncertainty. Important: each query should be answerable by one number. Wrong examples: 1) explicitly asking for verbal justification; 2) asking about two players’ intrinsic strengths within one query. Howe...