pith. sign in

arxiv: 2606.07515 · v2 · pith:AR6UHUZYnew · submitted 2026-06-05 · 💻 cs.CL · cs.AI· cs.HC· math.PR

How reliable are LLMs when it comes to playing dice?

Pith reviewed 2026-06-27 21:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCmath.PR
keywords large language modelsprobabilistic reasoningdice problemscounterintuitive problemstoken biasmisleading promptschain-of-thought promptingbenchmark evaluation
0
0 comments X

The pith

Large language models achieve 96 percent accuracy on standard dice probability problems but only 59 percent on counterintuitive ones and drop further with disguised wording or misleading prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates eight current LLMs on two datasets of discrete probability problems involving dice rolls. Models perform near ceiling on ordinary exercises but fall to roughly 59 percent accuracy when the problems are framed to elicit common heuristic errors. Accuracy declines more than 20 percent when standard problem statements are rewritten in disguised forms, and it falls by as much as 34 percent when the prompt contains embedded misleading suggestions. Chain-of-thought prompting does not eliminate these drops. The pattern leads the authors to conclude that existing models are not yet genuine probabilistic reasoners even though they solve many advanced mathematical tasks.

Core claim

Current LLMs are not genuine probabilistic reasoners. On a benchmark of standard discrete probability exercises they reach an average accuracy of 0.96, yet the same models reach only 0.59 on counterintuitive exercises designed to trigger heuristic mistakes. Replacing canonical statements with disguised variants reduces performance by more than 20 percent, and the insertion of misleading suggestions into the prompt produces drops of up to 34 percent, with every tested model affected.

What carries the argument

A controlled benchmarking study that contrasts performance on standard versus counterintuitive discrete probability problems, augmented by tests that measure token bias through disguised reformulations and robustness through the addition of misleading prompt suggestions, each run both with and without chain-of-thought prompting.

If this is right

  • Success on advanced mathematical problems does not guarantee reliable handling of probabilistic uncertainty.
  • Models appear to rely on surface-level patterns rather than internal probabilistic models.
  • No current architecture tested is immune to the effects of disguised wording or misleading cues.
  • Real-world applications that require probabilistic judgment will need external verification or additional safeguards.
  • Evaluation protocols for LLMs should routinely include counterintuitive and adversarially worded items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same surface-pattern reliance could affect performance in other domains that require reasoning under uncertainty, such as risk assessment or causal inference.
  • Targeted fine-tuning on counterintuitive probability examples might raise accuracy, but whether the improvement would generalize remains open.
  • If scaling laws alone do not close the gap, architectural changes that embed explicit probabilistic modules may become necessary.
  • The benchmark could be extended to continuous distributions or to multi-step probabilistic planning tasks to test the breadth of the limitation.

Load-bearing premise

That lower performance on the chosen counterintuitive exercises and on disguised or misleading prompts demonstrates absence of genuine probabilistic reasoning rather than limitations specific to the selected test items or prompting formats.

What would settle it

An LLM that maintains accuracy above 90 percent on both the standard and counterintuitive dice problems, shows no measurable drop when canonical statements are replaced by disguised variants, and remains unaffected by the insertion of misleading suggestions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.07515 by Bernardo Busoni, Gianmarco Bet, Luca Avena.

Figure 1
Figure 1. Figure 1: Performance over Bias Dataset In the second case, the differences between models with and without CoT are modest, whereas in the counterintuitive dataset these differences are significantly more pronounced. This phenomenon can be interpreted in two ways, which are likely to be complementary. On the one hand, the greater complexity of the counterintuitive tasks places the models 3 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 2
Figure 2. Figure 2: Performance over Standard Dataset under greater strain, making the differences in capability between the various configurations more apparent. On the other hand, models with CoT, being able to articulate an explicit reasoning process before providing the answer, appear to show greater robustness against the cognitive biases that these exercises are designed to trigger, thanks to a more structured form of i… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between Open Models 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between Closed Models 2.2 Evidence of token bias: classic vs modified problems A further experiment compared the models’ performance on well-known classical paradoxes with modified versions that preserved the same probabilistic structure whilst masking their surface form. With the exception of the famous Simpson’s paradox – in which this pattern is nevertheless evident – all models perform very … view at source ↗
Figure 5
Figure 5. Figure 5: Classic vs Modified Comparison An important caveat is that masking was not always fully effective. In some cases, stronger models explicitly recognised the original source problem despite reformulation, indicating partial retrieval from training memory. Therefore, these measures should be interpreted as relative indicators rather than exact estimates of pure reasoning ability. Interestingly, token bias can… view at source ↗
Figure 6
Figure 6. Figure 6: Relative sycophancy decay 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Absolute sycophancy decay 3 Methods The methodological approach adopted in this study is informed by the academic background of the authors. In particular, the expertise of the authors in probability theory and mathematical modelling, together with experience in teaching these topics, has shaped both the construction of the problem sets and the criteria used for their analysis, with a focus on formal rigou… view at source ↗
read the original abstract

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a benchmarking study evaluating 8 state-of-the-art LLMs on two datasets of discrete probability problems (standard exercises and counterintuitive exercises designed to trigger heuristics), with and without Chain-of-Thought prompting. It finds average accuracy of 0.96 on standard problems versus 0.59 on counterintuitive ones, with further drops of over 20% on disguised variants and up to 34% when misleading suggestions are embedded, and concludes that current LLMs are not yet genuine probabilistic reasoners.

Significance. If the performance gaps prove robust under replication, the work would usefully document prompt sensitivity and heuristic vulnerabilities in LLMs on probabilistic tasks, complementing existing results on advanced math performance. The empirical focus on token bias and misleading prompts offers concrete, falsifiable observations that could inform prompting strategies and evaluation protocols in the field.

major comments (3)
  1. [Abstract] Abstract: the reported aggregate accuracies (0.96 and 0.59) and percentage drops (>20%, up to 34%) are supplied without sample sizes, number of items per dataset, number of trials per model, or any statistical tests, so the central performance claims cannot be evaluated for reliability or effect size.
  2. [Abstract] Abstract / Methods (implied): exact prompt templates, model versions, temperature settings, and the precise construction criteria for the counterintuitive and disguised items are not described, preventing assessment of whether the observed drops isolate probabilistic competence or reflect surface-form or elicitation artifacts.
  3. [Abstract] Abstract / Discussion: the leap from task-specific accuracy drops on the authors' chosen counterintuitive and misleading items to the claim that LLMs are 'not yet genuine probabilistic reasoners' is load-bearing; the manuscript does not provide controls showing these items are representative of probabilistic reasoning in general or that equivalent failures occur under alternative elicitation methods.
minor comments (2)
  1. The manuscript would benefit from a table listing per-model accuracies, standard deviations, and exact dataset sizes to allow readers to inspect variance across the 8 models.
  2. Notation for accuracy and drop percentages should be defined consistently (e.g., whether 0.59 is mean across models or pooled).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and specific suggestions. We respond to each major comment below and note planned changes to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported aggregate accuracies (0.96 and 0.59) and percentage drops (>20%, up to 34%) are supplied without sample sizes, number of items per dataset, number of trials per model, or any statistical tests, so the central performance claims cannot be evaluated for reliability or effect size.

    Authors: We agree that the abstract should be self-contained on these points. The manuscript uses two datasets of 50 items each, evaluates each model over three independent trials per item, and reports paired statistical tests confirming the differences (p < 0.01). In revision we will add the dataset sizes, trial count, and a note on statistical testing directly to the abstract. revision: yes

  2. Referee: [Abstract] Abstract / Methods (implied): exact prompt templates, model versions, temperature settings, and the precise construction criteria for the counterintuitive and disguised items are not described, preventing assessment of whether the observed drops isolate probabilistic competence or reflect surface-form or elicitation artifacts.

    Authors: The full methods section and appendix already list the eight model versions, set temperature to 0.0, provide the complete prompt templates, and define construction criteria (counterintuitive items drawn from documented probability heuristics; disguised variants obtained by systematic lexical substitution while preserving logical structure). We will insert a concise methods paragraph and explicit appendix reference into the abstract to make these details immediately accessible. revision: yes

  3. Referee: [Abstract] Abstract / Discussion: the leap from task-specific accuracy drops on the authors' chosen counterintuitive and misleading items to the claim that LLMs are 'not yet genuine probabilistic reasoners' is load-bearing; the manuscript does not provide controls showing these items are representative of probabilistic reasoning in general or that equivalent failures occur under alternative elicitation methods.

    Authors: The items were deliberately sampled from the set of problems known in cognitive psychology to elicit heuristic errors; the observed drops are therefore diagnostic of heuristic reliance rather than incidental surface effects. The manuscript already includes within-prompt controls (standard vs. disguised vs. misleading-suggestion conditions) and tests both zero-shot and chain-of-thought elicitation. While we do not claim the items exhaust every possible probabilistic task, the pattern across models and conditions supports the qualified conclusion. We will expand the discussion section to articulate this rationale more explicitly but will not alter the central claim. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements with no derivations or self-referential definitions

full rationale

This is a pure benchmarking study that constructs two datasets of probability problems, runs eight LLMs on them under controlled prompting conditions, and reports observed accuracies (0.96 on standard items, 0.59 on counterintuitive items, plus drops under disguise and misleading prompts). No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or ansatzes appear anywhere in the text. The central claim is an interpretive summary of the measured performance gaps rather than a mathematical reduction; the reported numbers are obtained by direct evaluation and do not loop back to define themselves. Self-citations are absent from the load-bearing steps. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the chosen counterintuitive exercises and prompt manipulations isolate genuine probabilistic reasoning deficits; this is a domain assumption rather than a derived quantity.

axioms (1)
  • domain assumption The counterintuitive exercises are designed to trigger heuristic reasoning
    Stated directly in the abstract as the purpose of the second dataset.

pith-pipeline@v0.9.1-grok · 5671 in / 1208 out tokens · 23355 ms · 2026-06-27T21:54:51.271951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Counterintuitive problems in discrete probability

    math.PR 2026-06 unverdicted novelty 2.0

    A curated dataset of counterintuitive discrete probability problems with human solutions, built to benchmark LLM reasoning on bias-prone tasks.

Reference graph

Works this paper leans on

23 extracted references · 4 canonical work pages · cited by 1 Pith paper

  1. [1]

    Counterintuitive problems in discrete probability

    Luca Avena, Gianmarco Bet and Bernardo Busoni. Counterintuitive problems in discrete probability. 2026. arXiv: 2606.07516 [math.PR]. URL: https://arxiv.org/abs/2606.07516

  2. [2]

    ‘Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs’

    Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun and Martin Vechev. ‘Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs’. In: (2026). arXiv: 2605.00674 [cs.CL]. URL: https://arxiv.org/abs/2605.00674

  3. [3]

    MathArena

    Jasper Dekoninck, Kári Rögnvaldsson, Chenhao Sun, Ivo Petrov and Martin Vechev.Proof, Not Bluff: LLMs Reach 95% on the 2026 USA Math Olympiad. MathArena. Mar. 2026. URL: https://matharena.ai/usamo/ (visited on 26/05/2026)

  4. [4]

    A Ramsey-style Problem on Hypergraphs

    Epoch AI. A Ramsey-style Problem on Hypergraphs. 2026. URL: https://epoch.ai/frontiermath/open- problems/ramsey-hypergraphs/

  5. [5]

    Gallegos, Ryan A

    Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang and Nesreen K. Ahmed. ‘Bias and Fairness in Large Language Models: A Survey’. In: Computational Linguistics 50.3 (Sept. 2024), pp. 1097–1179. DOI: 10 . 1162 / coli _ a _ 00524. URL: https://direct.mit.edu/coli/article/50/3/1097/121961/Bi...

  6. [6]

    Aha! Gotcha: Paradoxes to Puzzle and Delight

    Martin Gardner. Aha! Gotcha: Paradoxes to Puzzle and Delight. W. H. Freeman, 1982, p. 164. ISBN : 978-0- 7167-1361-6

  7. [7]

    Time Travel and Other Mathematical Bewilderments

    Martin Gardner. Time Travel and Other Mathematical Bewilderments. New York: W. H. Freeman, 1988, p. 295. ISBN : 978-0-7167-1925-0

  8. [8]

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao and Adam Zsolt Wagner.Mathematical exploration and discovery at scale. 2025. arXiv: 2511.02864 [cs.NE]. URL: https://arxiv.org/abs/2511.02864. 9 A PREPRINT - 12 TH JUNE 2026

  9. [9]

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, S...

  10. [10]

    Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad

    Google DeepMind. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad . https://deepmind.google/blog/advanced- version- of - gemini - with - deep - think - officially - achieves - gold - medal - standard - at - the - international-mathematical-olympiad/. 2025

  11. [11]

    ‘Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT’

    Thilo Hagendorff, Sarah Fabi and Michal Kosinski. ‘Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT’. In:Nature Computational Science 3.10 (2023), pp. 833–

  12. [12]

    URL: https://doi.org/10.1038/s43588-023-00527-x

    DOI: 10.1038/s43588-023-00527-x . URL: https://doi.org/10.1038/s43588-023-00527-x

  13. [13]

    Su, Camillo J

    Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor and Dan Roth. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners. 2024. arXiv: 2406.11050 [cs.CL]. URL: https://arxiv.org/abs/2406.11050

  14. [14]

    Claude’s Cycles

    Donald Knuth. Claude’s Cycles. 2026. URL: https://www-cs-faculty.stanford.edu/~knuth/papers/ claude-cycles.pdf

  15. [15]

    Various probability puzzles posted on Daniel Litt’s X profile @littmath

    Daniel Litt. Various probability puzzles posted on Daniel Litt’s X profile @littmath . 2024. URL: https : //x.com/littmath

  16. [16]

    ‘(Ir)rationality and cognitive biases in large language models’

    Olivia Macmillan-Scott and Mirco Musolesi. ‘(Ir)rationality and cognitive biases in large language models’. In: Royal Society Open Science 11.6 (June 2024), p. 240255. ISSN : 2054-5703. DOI: 10.1098/rsos.240255. eprint: https://royalsocietypublishing.org/rsos/article- pdf/doi/10.1098/rsos.240255/ 510543/rsos.240255.pdf. URL: https://doi.org/10.1098/rsos.240255

  17. [17]

    Planar Point Sets with Many Unit Distances

    OpenAI. Planar Point Sets with Many Unit Distances. https://cdn.openai.com/pdf/74c24085-19b0- 4534-9c90-465b8e29ad73/unit-distance-proof.pdf . Preprint. May 2026. (Visited on 26/05/2026)

  18. [18]

    Teoria della probabilità: variabili aleatorie e distribuzioni

    Andrea Pascucci. Teoria della probabilità: variabili aleatorie e distribuzioni. Springer, 2020. ISBN : 978-88-470- 3999-5

  19. [19]

    BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

    Ivo Petrov, Jasper Dekoninck and Martin Vechev. BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs. 2025. arXiv: 2510.04721 [cs.AI]. URL: https://arxiv.org/abs/2510.04721

  20. [20]

    ‘A Problem in Probability’

    Steve Selvin. ‘A Problem in Probability’. In:The American Statistician 29.1 (Feb. 1975). Letter to the editor, p. 67. DOI: 10.1080/00031305.1975.10479121 . URL: https://www.tandfonline.com/doi/abs/10. 1080/00031305.1975.10479121

  21. [21]

    Bowman, New- ton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, New- ton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang and Ethan Perez. Towards Understanding Sycophancy in Language Models. 2025. arXi...

  22. [22]

    Slater, Ali Ziaee and Morgan Nguyen

    Gaurav Suri, Lily R. Slater, Ali Ziaee and Morgan Nguyen. Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5 . 2023. arXiv: 2305 . 04400 [cs.AI]. URL: https://arxiv.org/abs/2305.04400

  23. [23]

    ‘Judgment under Uncertainty: Heuristics and Biases’

    Amos Tversky and Daniel Kahneman. ‘Judgment under Uncertainty: Heuristics and Biases’. In:Science 185.4157 (Sept. 1974), pp. 1124–1131. DOI: 10.1126/science.185.4157.1124. URL: https://www.science.org/ doi/10.1126/science.185.4157.1124. 10