Counterintuitive problems in discrete probability
Pith reviewed 2026-06-27 21:01 UTC · model grok-4.3
The pith
A collection of counterintuitive discrete probability problems with human solutions is released as a public dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We have gathered and solved a set of discrete probability problems that are constructed to expose the gap between intuitive answers and correct probability calculations, making the full list and the accompanying solutions available for direct use in experiments on reasoning.
What carries the argument
The dataset itself, consisting of selected problems that each target a specific heuristic error in probability reasoning.
If this is right
- The problems supply a ready-made benchmark for measuring how often language models produce the same incorrect answers that humans reach through heuristics.
- Researchers can now run controlled comparisons between human performance and model performance on the same fixed list of items.
- The public release allows other groups to extend the collection or to add new variants while keeping the original solutions as a fixed reference.
- The problems can be used directly in teaching or in experiments that study the persistence of specific probability misconceptions.
Where Pith is reading between the lines
- The same selection criteria could be applied to continuous probability or to problems involving conditional independence to test whether the same pattern of model errors appears.
- Systematic logging of which problems cause the largest divergence between model and human answers might reveal clusters of related biases that current training data do not correct.
- Because the solutions are written out in full, the collection could serve as training material for supervised fine-tuning aimed at reducing specific probability errors.
Load-bearing premise
The selected problems do trigger the intended errors and the written solutions contain no hidden mistakes.
What would settle it
Discovery of a mathematically incorrect solution in the provided answers would show that the reference set cannot be trusted as a benchmark.
read the original abstract
This manuscript contains a collection of counterintuitive problems in discrete probability, together with detailed solutions. The dataset was constructed as part of a broader research project investigating the capabilities of the latest-generation Large Language Models (LLMs) in solving discrete probability problems, in order to assess whether LLMs tend to make systematic reasoning errors associated with known cognitive biases. The problems collected here are specifically designed to challenge heuristic reasoning strategies that often lead to intuitively appealing but mathematically incorrect conclusions. The dataset combines several types of problems. Some are adapted from classical probabilistic paradoxes and cognitive-bias literature, while others originate from recreational mathematics sources or were developed by ourselves following similar principles. The primary purpose of this document is to provide a transparent and publicly accessible reference for the problems used in our experimental evaluation of language models, as well as providing detailed human-made solutions. At the same time, we believe that this collection may also prove useful for future research on probabilistic reasoning, cognitive biases, and the evaluation of reasoning capabilities in artificial intelligence systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a curated collection of counterintuitive discrete probability problems accompanied by detailed human-made solutions. It is positioned as a transparent reference dataset for evaluating large language models on tasks designed to expose heuristic reasoning errors and cognitive biases, drawing from classical paradoxes, cognitive-bias literature, recreational mathematics, and original constructions.
Significance. If the supplied solutions are accurate, the collection provides a reusable benchmark resource for research on probabilistic reasoning in AI systems and for studies of cognitive biases. The explicit transparency in documenting the problem set and human solutions supports reproducibility in LLM evaluation experiments.
minor comments (1)
- [Abstract] Abstract: the description of the dataset construction mentions adaptation from multiple sources but does not state the total number of problems or their distribution across categories, which would help readers gauge the collection's scope and balance.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript, recognition of its potential utility as a benchmark resource, and recommendation for minor revision. The report contains no major comments to address.
Circularity Check
No circularity; descriptive problem collection with no derivations
full rationale
The manuscript presents a curated list of counterintuitive discrete probability problems together with human solutions. It asserts no mathematical identities, scaling relations, predictions, or fitted parameters whose validity depends on unverified self-referential steps. No equations, uniqueness theorems, or ansatzes are introduced; the text is explicitly positioned as a transparent reference dataset for LLM evaluation rather than a vehicle for novel derivations. Self-citations are absent from the provided content, and the construction description (adapting classical paradoxes or developing new problems) is purely declarative with no load-bearing reduction to prior outputs. This is the normal case of a self-contained reference document.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
How reliable are LLMs when it comes to playing dice?
LLMs score 0.96 on standard probability exercises but 0.59 on counterintuitive ones and drop further with biased wording or misleading cues, indicating they are not genuine probabilistic reasoners.
Reference graph
Works this paper leans on
-
[1]
How reliable are LLMs when it comes to playing dice? 2026
Luca Avena, Gianmarco Bet and Bernardo Busoni. How reliable are LLMs when it comes to playing dice? 2026. arXiv: 2606.07515 [cs.CL]. URL: https://arxiv.org/abs/2606.07515
Pith/arXiv arXiv 2026
-
[2]
Aha! Gotcha: Paradoxes to Puzzle and Delight
Martin Gardner. Aha! Gotcha: Paradoxes to Puzzle and Delight. W. H. Freeman, 1982, p. 164. ISBN : 978-0- 7167-1361-6
1982
-
[3]
Time Travel and Other Mathematical Bewilderments
Martin Gardner. Time Travel and Other Mathematical Bewilderments. New York: W. H. Freeman, 1988, p. 295. ISBN : 978-0-7167-1925-0
1988
- [4]
-
[5]
Norbert Henze and Günter Last. Absent-minded passengers. 2018. arXiv: 1809 . 10192 [math.PR]. URL: https://arxiv.org/abs/1809.10192
Pith/arXiv arXiv 2018
-
[6]
Various probability puzzles posted on Daniel Litt’s X profile @littmath
Daniel Litt. Various probability puzzles posted on Daniel Litt’s X profile @littmath . 2024. URL: https : //x.com/littmath
2024
-
[7]
Oliver Hawkins. Tuesday Boy. BBC News. Accessed: 2026-06-04. 2010. URL: http://news.bbc.co.uk/2/ hi/programmes/more_or_less/8735812.stm
arXiv 2026
-
[8]
Christopher M. Rump. ‘Strategies for Rolling the Efron Dice’. In:Mathematics Magazine 74.3 (2001), pp. 212–
2001
-
[9]
URL: https://www.jstor.org/stable/2690722
DOI: 10.1080/0025570X.2001.11953065. URL: https://www.jstor.org/stable/2690722
-
[10]
Steve Selvin. ‘A Problem in Probability’. In:The American Statistician 29.1 (Feb. 1975). Letter to the editor, p. 67. DOI: 10.1080/00031305.1975.10479121 . URL: https://www.tandfonline.com/doi/abs/10. 1080/00031305.1975.10479121
-
[11]
E. H. Simpson. ‘The Interpretation of Interaction in Contingency Tables’. In:Journal of the Royal Statistical Society: Series B (Methodological) 13.2 (July 1951), pp. 238–241. ISSN : 0035-9246. DOI: 10.1111/j.2517- 6161 . 1951 . tb00088 . x. eprint: https : / / academic . oup . com / jrsssb / article - pdf / 13 / 2 / 238 / 49093972/jrsssb_13_2_238.pdf. UR...
-
[12]
‘Judgment under Uncertainty: Heuristics and Biases’
Amos Tversky and Daniel Kahneman. ‘Judgment under Uncertainty: Heuristics and Biases’. In:Science 185.4157 (Sept. 1974), pp. 1124–1131. DOI: 10.1126/science.185.4157.1124. URL: https://www.science.org/ doi/10.1126/science.185.4157.1124. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.