Counterintuitive problems in discrete probability

Bernardo Busoni; Gianmarco Bet; Luca Avena

arxiv: 2606.07516 · v2 · pith:RUOQNAF4new · submitted 2026-06-05 · 🧮 math.PR

Counterintuitive problems in discrete probability

Luca Avena , Gianmarco Bet , Bernardo Busoni This is my paper

Pith reviewed 2026-06-27 21:01 UTC · model grok-4.3

classification 🧮 math.PR

keywords discrete probabilitycounterintuitive problemscognitive biaseslarge language modelsprobabilistic paradoxesreasoning evaluationdataset

0 comments

The pith

A collection of counterintuitive discrete probability problems with human solutions is released as a public dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles a dataset of discrete probability problems chosen because they reliably produce answers that feel right but are wrong. Some problems come from well-known paradoxes and bias studies, others from recreational sources, and a few were created for this work. Each problem includes a detailed solution written by humans. The explicit goal is to supply a transparent reference set that can be used to test whether large language models repeat the same kinds of errors that humans make under heuristic reasoning.

Core claim

We have gathered and solved a set of discrete probability problems that are constructed to expose the gap between intuitive answers and correct probability calculations, making the full list and the accompanying solutions available for direct use in experiments on reasoning.

What carries the argument

The dataset itself, consisting of selected problems that each target a specific heuristic error in probability reasoning.

If this is right

The problems supply a ready-made benchmark for measuring how often language models produce the same incorrect answers that humans reach through heuristics.
Researchers can now run controlled comparisons between human performance and model performance on the same fixed list of items.
The public release allows other groups to extend the collection or to add new variants while keeping the original solutions as a fixed reference.
The problems can be used directly in teaching or in experiments that study the persistence of specific probability misconceptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection criteria could be applied to continuous probability or to problems involving conditional independence to test whether the same pattern of model errors appears.
Systematic logging of which problems cause the largest divergence between model and human answers might reveal clusters of related biases that current training data do not correct.
Because the solutions are written out in full, the collection could serve as training material for supervised fine-tuning aimed at reducing specific probability errors.

Load-bearing premise

The selected problems do trigger the intended errors and the written solutions contain no hidden mistakes.

What would settle it

Discovery of a mathematically incorrect solution in the provided answers would show that the reference set cannot be trusted as a benchmark.

read the original abstract

This manuscript contains a collection of counterintuitive problems in discrete probability, together with detailed solutions. The dataset was constructed as part of a broader research project investigating the capabilities of the latest-generation Large Language Models (LLMs) in solving discrete probability problems, in order to assess whether LLMs tend to make systematic reasoning errors associated with known cognitive biases. The problems collected here are specifically designed to challenge heuristic reasoning strategies that often lead to intuitively appealing but mathematically incorrect conclusions. The dataset combines several types of problems. Some are adapted from classical probabilistic paradoxes and cognitive-bias literature, while others originate from recreational mathematics sources or were developed by ourselves following similar principles. The primary purpose of this document is to provide a transparent and publicly accessible reference for the problems used in our experimental evaluation of language models, as well as providing detailed human-made solutions. At the same time, we believe that this collection may also prove useful for future research on probabilistic reasoning, cognitive biases, and the evaluation of reasoning capabilities in artificial intelligence systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a curated list of existing counterintuitive probability problems with solutions, assembled as a benchmark for LLMs rather than a source of new math.

read the letter

This manuscript is a collection of counterintuitive problems in discrete probability along with their solutions. The authors created it as part of work on how large language models perform on these kinds of tasks, specifically to check for systematic errors that match known human cognitive biases.

The paper does a solid job of documenting the problems and providing the solutions in one place. It draws from classical paradoxes, cognitive bias research, recreational mathematics, and some problems developed by the authors. Having the full set with explanations available publicly is helpful for anyone who wants to run similar evaluations or build on this test suite.

Where it falls short is in originality. The problems are explicitly adaptations of existing material, and the paper introduces no new theorems, derivations, or empirical findings about probability itself. Its purpose is to serve as a reference dataset rather than to advance understanding of the underlying mathematics or biases. The abstract makes this positioning clear, so readers should not expect novel research results.

A small practical issue is the lack of any reported validation for the solutions. They are described as human-made, but without details on how they were checked or cross-verified, there is some risk that an error could slip through, especially in the self-developed problems. That said, for many of the classical cases the solutions are standard.

This work is aimed at people studying reasoning in artificial intelligence and those interested in using probability problems to probe model capabilities. It could be a convenient resource for that community.

I would not send this for peer review. It lacks the kind of scientific claim or new contribution that typically justifies the effort of referees. It might be appropriate as a supplementary document or in a venue focused on datasets and benchmarks, but as a standalone paper it is too thin.

Referee Report

0 major / 1 minor

Summary. The manuscript presents a curated collection of counterintuitive discrete probability problems accompanied by detailed human-made solutions. It is positioned as a transparent reference dataset for evaluating large language models on tasks designed to expose heuristic reasoning errors and cognitive biases, drawing from classical paradoxes, cognitive-bias literature, recreational mathematics, and original constructions.

Significance. If the supplied solutions are accurate, the collection provides a reusable benchmark resource for research on probabilistic reasoning in AI systems and for studies of cognitive biases. The explicit transparency in documenting the problem set and human solutions supports reproducibility in LLM evaluation experiments.

minor comments (1)

[Abstract] Abstract: the description of the dataset construction mentions adaptation from multiple sources but does not state the total number of problems or their distribution across categories, which would help readers gauge the collection's scope and balance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its potential utility as a benchmark resource, and recommendation for minor revision. The report contains no major comments to address.

Circularity Check

0 steps flagged

No circularity; descriptive problem collection with no derivations

full rationale

The manuscript presents a curated list of counterintuitive discrete probability problems together with human solutions. It asserts no mathematical identities, scaling relations, predictions, or fitted parameters whose validity depends on unverified self-referential steps. No equations, uniqueness theorems, or ansatzes are introduced; the text is explicitly positioned as a transparent reference dataset for LLM evaluation rather than a vehicle for novel derivations. Self-citations are absent from the provided content, and the construction description (adapting classical paradoxes or developing new problems) is purely declarative with no load-bearing reduction to prior outputs. This is the normal case of a self-contained reference document.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is presented; the work is a curated list of existing problems.

pith-pipeline@v0.9.1-grok · 5696 in / 909 out tokens · 15770 ms · 2026-06-27T21:01:00.975738+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How reliable are LLMs when it comes to playing dice?
cs.CL 2026-06 unverdicted novelty 5.0

LLMs score 0.96 on standard probability exercises but 0.59 on counterintuitive ones and drop further with biased wording or misleading cues, indicating they are not genuine probabilistic reasoners.

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · cited by 1 Pith paper

[1]

How reliable are LLMs when it comes to playing dice? 2026

Luca Avena, Gianmarco Bet and Bernardo Busoni. How reliable are LLMs when it comes to playing dice? 2026. arXiv: 2606.07515 [cs.CL]. URL: https://arxiv.org/abs/2606.07515

Pith/arXiv arXiv 2026
[2]

Aha! Gotcha: Paradoxes to Puzzle and Delight

Martin Gardner. Aha! Gotcha: Paradoxes to Puzzle and Delight. W. H. Freeman, 1982, p. 164. ISBN : 978-0- 7167-1361-6

1982
[3]

Time Travel and Other Mathematical Bewilderments

Martin Gardner. Time Travel and Other Mathematical Bewilderments. New York: W. H. Freeman, 1988, p. 295. ISBN : 978-0-7167-1925-0

1988
[4]

Grimmett

Geoffrey R. Grimmett. Alice and Bob on X: reversal, coupling, renewal. 2025. arXiv: 2409.00732 [math.PR]. URL: https://arxiv.org/abs/2409.00732

arXiv 2025
[5]

Absent-minded passengers

Norbert Henze and Günter Last. Absent-minded passengers. 2018. arXiv: 1809 . 10192 [math.PR]. URL: https://arxiv.org/abs/1809.10192

Pith/arXiv arXiv 2018
[6]

Various probability puzzles posted on Daniel Litt’s X profile @littmath

Daniel Litt. Various probability puzzles posted on Daniel Litt’s X profile @littmath . 2024. URL: https : //x.com/littmath

2024
[7]

Tuesday Boy

Oliver Hawkins. Tuesday Boy. BBC News. Accessed: 2026-06-04. 2010. URL: http://news.bbc.co.uk/2/ hi/programmes/more_or_less/8735812.stm

arXiv 2026
[8]

Christopher M. Rump. ‘Strategies for Rolling the Efron Dice’. In:Mathematics Magazine 74.3 (2001), pp. 212–

2001
[9]

URL: https://www.jstor.org/stable/2690722

DOI: 10.1080/0025570X.2001.11953065. URL: https://www.jstor.org/stable/2690722

work page doi:10.1080/0025570x.2001.11953065 2001
[10]

‘A Problem in Probability’

Steve Selvin. ‘A Problem in Probability’. In:The American Statistician 29.1 (Feb. 1975). Letter to the editor, p. 67. DOI: 10.1080/00031305.1975.10479121 . URL: https://www.tandfonline.com/doi/abs/10. 1080/00031305.1975.10479121

work page doi:10.1080/00031305.1975.10479121 1975
[11]

E. H. Simpson. ‘The Interpretation of Interaction in Contingency Tables’. In:Journal of the Royal Statistical Society: Series B (Methodological) 13.2 (July 1951), pp. 238–241. ISSN : 0035-9246. DOI: 10.1111/j.2517- 6161 . 1951 . tb00088 . x. eprint: https : / / academic . oup . com / jrsssb / article - pdf / 13 / 2 / 238 / 49093972/jrsssb_13_2_238.pdf. UR...

work page doi:10.1111/j.2517- 1951
[12]

‘Judgment under Uncertainty: Heuristics and Biases’

Amos Tversky and Daniel Kahneman. ‘Judgment under Uncertainty: Heuristics and Biases’. In:Science 185.4157 (Sept. 1974), pp. 1124–1131. DOI: 10.1126/science.185.4157.1124. URL: https://www.science.org/ doi/10.1126/science.185.4157.1124. 17

work page doi:10.1126/science.185.4157.1124 1974

[1] [1]

How reliable are LLMs when it comes to playing dice? 2026

Luca Avena, Gianmarco Bet and Bernardo Busoni. How reliable are LLMs when it comes to playing dice? 2026. arXiv: 2606.07515 [cs.CL]. URL: https://arxiv.org/abs/2606.07515

Pith/arXiv arXiv 2026

[2] [2]

Aha! Gotcha: Paradoxes to Puzzle and Delight

Martin Gardner. Aha! Gotcha: Paradoxes to Puzzle and Delight. W. H. Freeman, 1982, p. 164. ISBN : 978-0- 7167-1361-6

1982

[3] [3]

Time Travel and Other Mathematical Bewilderments

Martin Gardner. Time Travel and Other Mathematical Bewilderments. New York: W. H. Freeman, 1988, p. 295. ISBN : 978-0-7167-1925-0

1988

[4] [4]

Grimmett

Geoffrey R. Grimmett. Alice and Bob on X: reversal, coupling, renewal. 2025. arXiv: 2409.00732 [math.PR]. URL: https://arxiv.org/abs/2409.00732

arXiv 2025

[5] [5]

Absent-minded passengers

Norbert Henze and Günter Last. Absent-minded passengers. 2018. arXiv: 1809 . 10192 [math.PR]. URL: https://arxiv.org/abs/1809.10192

Pith/arXiv arXiv 2018

[6] [6]

Various probability puzzles posted on Daniel Litt’s X profile @littmath

Daniel Litt. Various probability puzzles posted on Daniel Litt’s X profile @littmath . 2024. URL: https : //x.com/littmath

2024

[7] [7]

Tuesday Boy

Oliver Hawkins. Tuesday Boy. BBC News. Accessed: 2026-06-04. 2010. URL: http://news.bbc.co.uk/2/ hi/programmes/more_or_less/8735812.stm

arXiv 2026

[8] [8]

Christopher M. Rump. ‘Strategies for Rolling the Efron Dice’. In:Mathematics Magazine 74.3 (2001), pp. 212–

2001

[9] [9]

URL: https://www.jstor.org/stable/2690722

DOI: 10.1080/0025570X.2001.11953065. URL: https://www.jstor.org/stable/2690722

work page doi:10.1080/0025570x.2001.11953065 2001

[10] [10]

‘A Problem in Probability’

Steve Selvin. ‘A Problem in Probability’. In:The American Statistician 29.1 (Feb. 1975). Letter to the editor, p. 67. DOI: 10.1080/00031305.1975.10479121 . URL: https://www.tandfonline.com/doi/abs/10. 1080/00031305.1975.10479121

work page doi:10.1080/00031305.1975.10479121 1975

[11] [11]

E. H. Simpson. ‘The Interpretation of Interaction in Contingency Tables’. In:Journal of the Royal Statistical Society: Series B (Methodological) 13.2 (July 1951), pp. 238–241. ISSN : 0035-9246. DOI: 10.1111/j.2517- 6161 . 1951 . tb00088 . x. eprint: https : / / academic . oup . com / jrsssb / article - pdf / 13 / 2 / 238 / 49093972/jrsssb_13_2_238.pdf. UR...

work page doi:10.1111/j.2517- 1951

[12] [12]

‘Judgment under Uncertainty: Heuristics and Biases’

Amos Tversky and Daniel Kahneman. ‘Judgment under Uncertainty: Heuristics and Biases’. In:Science 185.4157 (Sept. 1974), pp. 1124–1131. DOI: 10.1126/science.185.4157.1124. URL: https://www.science.org/ doi/10.1126/science.185.4157.1124. 17

work page doi:10.1126/science.185.4157.1124 1974