pith. sign in

arxiv: 2606.06622 · v2 · pith:HP3YCFMOnew · submitted 2026-06-04 · 💻 cs.CL

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Pith reviewed 2026-06-28 01:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords UnpredictaBenchdistributional samplingLLM evaluationKolmogorov-Smirnov teststochastic simulationoutput calibrationbenchmarkrandomness
0
0 comments X

The pith

No LLM exceeds 40 percent on KS@100 when sampling from target distributions in UnpredictaBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UnpredictaBench to measure whether large language models can generate outputs that match the full variability of a target distribution rather than collapsing to one or a few plausible answers. This capability matters because models are already being used to stand in for humans or other stochastic entities in economic simulations and similar settings, where missing the spread of possible outcomes produces unrealistic results. The benchmark supplies 448 concrete problems drawn from statistical distributions, stochastic programs, and natural-language random processes, scored by the KS@N metric that applies the Kolmogorov-Smirnov test to compare model samples against ground-truth draws. Results across many open and closed models show a wide performance range but a hard ceiling below 40 percent at the standard KS@100 level, with only modest gains from added reasoning steps.

Core claim

UnpredictaBench isolates the task of sampling from individual target distributions and shows that current models cannot do so reliably: even the best systems fall short of 40 percent on the KS@100 metric, confirming that distributional simulation remains an open challenge even for simple cases.

What carries the argument

UnpredictaBench, a collection of 448 problems paired with the KS@N metric that counts the fraction of trials in which Kolmogorov-Smirnov tests fail to reject the hypothesis that model samples of size N come from the same distribution as ground-truth samples.

If this is right

  • LLMs cannot yet serve as calibrated substitutes for stochastic agents in simulations without additional mechanisms for distributional matching.
  • Output-diversity techniques alone are insufficient because they do not guarantee calibration to a specific target distribution.
  • Reasoning enhancements yield only partial improvement and do not close the gap to acceptable performance.
  • Substantial room remains for new training or inference methods aimed at distributional sampling.
  • The benchmark supplies a concrete, quantitative signal for tracking progress on this capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limitation likely constrains the reliability of LLM-based Monte Carlo methods or multi-agent economic models that rely on repeated random draws.
  • Extending the benchmark to joint distributions over several variables would test whether the observed shortfall scales to more realistic simulation settings.
  • Training objectives that directly penalize deviations measured by statistical tests could be explored as a targeted remedy.

Load-bearing premise

The 448 problems and the KS@N metric together provide a sufficient proxy for the distributional sampling demands that arise when LLMs are used as stand-ins for complex real-world systems.

What would settle it

A model that scores above 40 percent on KS@100 across the full set of 448 problems, or a demonstration that low KS@N scores do not predict failures in downstream simulation tasks that require calibrated randomness.

Figures

Figures reproduced from arXiv: 2606.06622 by Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Ellie Dingqiao Wen, Giuseppe Carenini, Lele Wang, Liang Luo, Peter West.

Figure 5
Figure 5. Figure 5: Pearson correlation between UNPREDICTABENCH KS@100 and metrics from Novelty￾Bench and CREATE across seven models. Each scatter plot compares one external benchmark metric against KS@100, with a fitted regression line. 4 6 8 CREATE Utility (p=0.7) 10 0 10 20 30 UnpredictaBench KN@100 r = 0.750 4 6 8 10 CREATE Utility (p=0.9) r = 0.779* 2 3 4 5 6 7 NoveltyBench Distinct10 r = -0.206 2.50 2.75 3.00 3.25 3.50 … view at source ↗
Figure 6
Figure 6. Figure 6: Mean and min-max range of KS@100 (%), Jensen-Shannon Divergence (JSD), and [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-run output diversity on the shuffling task, measured as the number of unique items [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model sample distributions vs. ground truth for the [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model sample distributions vs. ground truth for the [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model sample distributions vs. ground truth for the [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Model sample distributions vs. ground truth for the [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
read the original abstract

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UnpredictaBench, consisting of 448 tasks that require LLMs to sample from target distributions drawn from canonical statistical families, stochastic programs, and natural-language descriptions of random processes. It defines the KS@N metric as the fraction of trials in which a Kolmogorov-Smirnov test fails to reject the hypothesis that N model samples are drawn from the same distribution as ground-truth samples. Across open and proprietary models the reported KS@100 scores range from near zero to above 20 percent, with no model exceeding 40 percent; modest gains are observed when chain-of-thought reasoning is added, but the gap remains large.

Significance. If the empirical findings are robust, the work identifies a concrete capability gap in distributional calibration that is directly relevant to the growing use of LLMs as substitutes for agents or stochastic components in simulations. The benchmark supplies a reproducible, statistically grounded yardstick (KS@N) that is independent of any model-specific parameters, and the systematic coverage of three distinct task sources is a clear methodological strength.

major comments (2)
  1. [Abstract and §4 (Results)] Abstract and §4 (Results): The headline claim that the benchmark reveals 'significant headroom' for using LLMs as stand-ins for complex real-world systems rests on the untested assumption that success on isolated single-distribution sampling transfers to joint, conditional, and temporally extended sampling. No correlation study, ablation, or external validation is reported that links KS@N scores to calibration performance in multi-component settings (e.g., multi-agent economic models).
  2. [§3 (Benchmark Construction)] §3 (Benchmark Construction): The manuscript supplies no quantitative details on how the 448 tasks were sampled, what criteria governed inclusion or difficulty calibration, or the statistical power of the KS tests at each N. Without these, it is impossible to determine whether the reported performance ceiling (no model >40 % at KS@100) is an artifact of task selection or a genuine capability limit.
minor comments (2)
  1. [§2 (Metric)] The definition of the KS@N failure-to-reject rate would benefit from an explicit equation or pseudocode in the methods section to avoid ambiguity about how ties and multiple trials are aggregated.
  2. [Figures 2-4] Figure captions and axis labels should explicitly state the number of trials per model-task pair so that readers can assess variance in the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4 (Results)] The headline claim that the benchmark reveals 'significant headroom' for using LLMs as stand-ins for complex real-world systems rests on the untested assumption that success on isolated single-distribution sampling transfers to joint, conditional, and temporally extended sampling. No correlation study, ablation, or external validation is reported that links KS@N scores to calibration performance in multi-component settings (e.g., multi-agent economic models).

    Authors: We agree that the manuscript does not provide empirical evidence linking single-distribution KS@N performance to multi-component or temporally extended settings. The paper explicitly frames UnpredictaBench as testing 'a simplified but fundamental version of this problem' (abstract) and positions it as 'a necessary first step' rather than a complete proxy. The headline claim of headroom is scoped to distributional sampling capability itself. To strengthen the manuscript we will revise the abstract and §4 to include an explicit limitations paragraph acknowledging the transfer assumption and outlining planned extensions to joint/conditional sampling. No new experiments are feasible within the current revision cycle. revision: partial

  2. Referee: [§3 (Benchmark Construction)] The manuscript supplies no quantitative details on how the 448 tasks were sampled, what criteria governed inclusion or difficulty calibration, or the statistical power of the KS tests at each N. Without these, it is impossible to determine whether the reported performance ceiling (no model >40 % at KS@100) is an artifact of task selection or a genuine capability limit.

    Authors: We will expand §3 with the requested details: the procedure for sampling parameter ranges from each statistical family, the inclusion criteria (coverage of 8 canonical families plus stochastic programs and NL scenarios, with parameter bounds chosen to avoid degenerate cases), and a power analysis of the KS test at N=10, 50, 100, and 500 using the ground-truth sample sizes. These additions will be placed in a new subsection on task generation and statistical considerations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and KS metric are externally grounded

full rationale

The paper defines UnpredictaBench (448 problems) and KS@N (failure-to-reject rate under Kolmogorov-Smirnov test) as new constructs, then reports empirical scores on open/proprietary models. The headline result (no model >40% at KS@100) is a direct measurement against ground-truth samples using the standard external KS procedure; it does not reduce to any fitted parameter, self-citation chain, or definitional equivalence inside the paper. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces an evaluation framework whose central elements are the problem set and the KS@N metric; these rest on the standard statistical properties of the Kolmogorov-Smirnov test and on the design choice of the 448 tasks.

axioms (1)
  • domain assumption The Kolmogorov-Smirnov test provides a valid measure of whether model-generated samples approximate a target distribution.
    Directly invoked as the basis for the KS@N metric in the abstract.

pith-pipeline@v0.9.1-grok · 5859 in / 1306 out tokens · 33418 ms · 2026-06-28T01:41:19.429911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 3 canonical work pages

  1. [1]

    Introducing Claude Sonnet 4.6

    Anthropic. Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, February 2026. Accessed: 2026-04-20

  2. [2]

    Dick, Hidenori Tanaka, and Tomer Ullman

    Eric J Bigelow, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, and Tomer Ullman. In-context learning dynamics with random binary sequences. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=62K7mALO2q

  3. [3]

    Specializing large language models to simulate survey response distributions for global populations

    Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Her- shcovich. Specializing large language models to simulate survey response distributions for global populations. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa...

  4. [4]

    Deterministic or probabilistic? the psychology of llms as random number generators, 2025

    Javier Coronado-Blázquez. Deterministic or probabilistic? the psychology of llms as random number generators, 2025. URLhttps://arxiv.org/abs/2502.19965

  5. [5]

    Deepseek-v3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL https://arxiv.org/abs/2512.02556

  6. [6]

    Do LLMs play dice? exploring probability distribution sampling in large language models for behavioral simulation

    Jia Gu, Liang Pang, Huawei Shen, and Xueqi Cheng. Do LLMs play dice? exploring probability distribution sampling in large language models for behavioral simulation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguisti...

  7. [7]

    The illusion of stochasticity in llms, 2026

    Xiangming Gu, Soham De, Michalis Titsias, Larisa Markeeva, Petar Veliˇckovi´c, and Razvan Pascanu. The illusion of stochasticity in llms, 2026. URL https://arxiv.org/abs/2604. 06543

  8. [8]

    Zihao Guo, Hongtao Lv, Chaoli Zhang, Yibowen Zhao, Yixin Zhang, and Lizhen Cui. The illusion of randomness: How LLMs fail to emulate stochastic decision-making in rock-paper- scissors games? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, page...

  9. [9]

    Can LLMs generate random numbers? eval- uating LLM sampling in controlled domains

    Aspen K Hopkins, Alex Renda, and Michael Carbin. Can LLMs generate random numbers? eval- uating LLM sampling in controlled domains. InICML 2023 Workshop: Sampling and Optimiza- tion in Discrete Space, 2023. URLhttps://openreview.net/forum?id=Vhh1K9LjVI

  10. [10]

    Inner monologue: Em- bodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and brian ichter. Inner monologue: Em- bodied reasoning through planning with language models. In6th Annual Conference on Robot Learning...

  11. [11]

    Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron David, Andy Zeng, and Chuyuan Kelly Fu

    brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar ...

  12. [12]

    Introducing Mercury 2

    Inception Labs. Introducing Mercury 2. https://www.inceptionlabs.ai/blog/ introducing-mercury-2, February 2026. Accessed: 2026-04-20

  13. [13]

    How random is random? evaluating the random- ness and humaness of llms’ coin flips, 2024

    Katherine Van Koevering and Jon Kleinberg. How random is random? evaluating the random- ness and humaness of llms’ coin flips, 2024. URLhttps://arxiv.org/abs/2406.00092

  14. [14]

    Kolmogorov

    Andrey N. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione.Giornale dell’Istituto Italiano degli Attuari, 4:83–91, 1933

  15. [15]

    Reinforcement learning from human feedback, 2026

    Nathan Lambert. Reinforcement learning from human feedback, 2026. URL https://arxiv. org/abs/2504.12501

  16. [16]

    D. H. Lehmer. Teaching combinatorial tricks to a computer. 1960. URL https://api. semanticscholar.org/CorpusID:115452165

  17. [17]

    Preserving diversity in supervised fine-tuning of large language models

    Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=NQEe7B7bSw

  18. [18]

    Divergence measures based on the shannon entropy.IEEE Transactions on In- formation theory, 37(1):145–151, 2002

    Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on In- formation theory, 37(1):145–151, 2002. URL https://ieeexplore.ieee.org/document/ 61115

  19. [19]

    The llama 3 herd of models, 2024

    AI @ Meta Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783

  20. [20]

    Estimating semantic alphabet size for LLM uncertainty quantification

    Lucas Hurley McCabe, Rimon Melamed, Thomas Hartvigsen, and H Howie Huang. Estimating semantic alphabet size for LLM uncertainty quantification. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=uYK6GPVg1O

  21. [21]

    Discover the new multi-lingual, high-quality Phi-3.5 SLMs

    Microsoft. Discover the new multi-lingual, high-quality Phi-3.5 SLMs. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/ discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280 , August

  22. [22]

    Accessed: 2026-04-20

  23. [23]

    Introducing Mistral 3

    Mistral AI. Introducing Mistral 3. https://mistral.ai/news/mistral-3, December 2025. Accessed: 2026-04-20

  24. [24]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    NVIDIA. Nvidia nemotron 3: Efficient and open intelligence, 2025. URL https://arxiv. org/abs/2512.20856. White Paper

  25. [25]

    Team Olmo, :and Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

  26. [26]

    Hello GPT-4o

    OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/, May 2024. Ac- cessed: 2026-04-20. 16

  27. [27]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Accessed: 2026-04-20

  28. [28]

    In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400701320. d...

  29. [29]

    What are the odds? language models are capable of probabilistic reasoning

    Akshay Paruchuri, Jake Garrison, Shun Liao, John B Hernandez, Jacob Sunshine, Tim Althoff, Xin Liu, and Daniel McDuff. What are the odds? language models are capable of probabilistic reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11712–1...

  30. [30]

    Epidemiol- ogy of large language models: A benchmark for observational distribution knowledge

    Drago Plevcko, Patrik Okanovic, Torsten Hoefler, and Elias Bareinboim. Epidemiol- ogy of large language models: A benchmark for observational distribution knowledge. ArXiv, abs/2511.03070, 2025. URL https://api.semanticscholar.org/CorpusID: 282757780

  31. [31]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5. Accessed: 2026-04-20

  32. [32]

    Reasoning under uncertainty: Efficient LLM inference via unsupervised confidence dilution and convergent adaptive sampling

    Zhenning Shi, Yijia Zhu, Yi Xie, Junhan Shi, Guorui Xie, Haotian Zhang, Yong Jiang, Congcong Miao, and Qing Li. Reasoning under uncertainty: Efficient LLM inference via unsupervised confidence dilution and convergent adaptive sampling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer...

  33. [33]

    Nikolai V . Smirnov. Table for estimating the goodness of fit of empirical distributions.The Annals of Mathematical Statistics, 19(2):279–281, 1948

  34. [34]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  35. [35]

    Create: Testing llms for associative creativity, 2026

    Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, and Greg Durrett. Create: Testing llms for associative creativity, 2026. URLhttps://arxiv.org/abs/2603.09970

  36. [36]

    Base models beat aligned models at randomness and creativity

    Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity. InSecond Conference on Language Modeling, 2025. URL https://openreview. net/forum?id=vqN8uom4A1

  37. [37]

    Grok 4.1.https://x.ai/news/grok-4-1, November 2025

    xAI. Grok 4.1.https://x.ai/news/grok-4-1, November 2025. Accessed: 2026-04-20

  38. [38]

    Embarrassingly simple self-distillation improves code generation, 2026

    Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation, 2026. URL https://arxiv.org/abs/2604.01193

  39. [39]

    Predicting effects, missing dis- tributions: Evaluating llms as human behavior simulators in operations management

    Runze Zhang, Xiaowei Zhang, and Mingyang Zhao. Predicting effects, missing dis- tributions: Evaluating llms as human behavior simulators in operations management. ArXiv, abs/2510.03310, 2025. URL https://api.semanticscholar.org/CorpusID: 281842519

  40. [40]

    Noveltybench: Evaluating creativity and diversity in language models

    Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and Daphne Ippolito. Noveltybench: Evaluating creativity and diversity in language models. InSecond Conference on Language Modeling, 2025. URL https://openreview. net/forum?id=XZm1ekzERf. 17

  41. [41]

    Large language models are bad dice players: Llms struggle to generate random numbers from statistical distributions, 2026

    Minda Zhao, Yilun Du, and Mengyu Wang. Large language models are bad dice players: Llms struggle to generate random numbers from statistical distributions, 2026. URL https: //arxiv.org/abs/2601.05414

  42. [42]

    P1", 50) path2 = Path(

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...

  43. [43]

    rationale

    If the output does not report any number (only code, explanation, etc.), return {{"rationale": "No number found in the model output", "number": null}}

  44. [44]

    rationale

    If the output reports a number, return {{"rationale": "The model mentioned «the_number» in the exact text «exact_span» at «exact_location» of the output", "number": <the_number>}}

  45. [45]

    rationale

    There may exist some cases that the model output is incomplete, malformed, or does not follow instructions. In those cases, you may see some numbers unrelated to the final answer (like repeating the list of parameters from the input distribution); therefore, if you cannot confidently identify a number being reported as the final answer, default to {{"rati...

  46. [46]

    Do not infer, calculate, or extract values from variable names, code structure, or explanatory text unless they are clearly presented as the answer

    Only use numbers that are explicitly present in the model output. Do not infer, calculate, or extract values from variable names, code structure, or explanatory text unless they are clearly presented as the answer. 49

  47. [47]

    For «exact_span», copy the smallest exact substring from the model output that contains the reported number

  48. [48]

    beginning of the output

    For «exact_location», briefly describe where that exact span occurs in the output. Example templates include "beginning of the output", "middle of the output", "end of the output", "first line", and "last line", but any other concise, precise location description is allowed

  49. [49]

    Do NOT quote any number unless it is the final reported answer

  50. [50]

    Before finalizing, verify that the JSON is valid, the rationale matches the chosen number or null outcome, and any number returned is explicitly presented in the model output as the final answer

  51. [51]

    rationale

    Return only one valid JSON object with exactly these keys and no extra text, markdown, or formatting: {{"rationale": "No number found in the model output", "number": null}} or {{"rationale": "The model mentioned «the_number» in the exact text «exact_span» at «exact_location» of the output", "number": <the_number>}}. Prompt N.14: Answer Extractor LLM (text...

  52. [52]

    rationale

    If the output does not clearly present exactly {expected_count} distinct final numeric answers, return {{"rationale": <string explaining why>, "numbers": null}}

  53. [53]

    rationale

    If the output clearly presents exactly {expected_count} final numeric answers (for example one number per line), return {{"rationale": <string summarizing where each value appears>, "numbers": [<n1>, <n2>, ...]}} with the numbers in the same order as in the model output (list length must be exactly {expected_count})

  54. [54]

    If you cannot confidently identify exactly {expected_count} values as the final answers, return "numbers": null

    Ignore numbers that are clearly not part of the final answers (parameters from the prompt, line numbers, unrelated code). If you cannot confidently identify exactly {expected_count} values as the final answers, return "numbers": null

  55. [55]

    Do not infer or calculate unstated values

    Only use numbers explicitly present in the model output. Do not infer or calculate unstated values

  56. [56]

    Each element of "numbers" must be a JSON number (integer or float), not a string. 50

  57. [57]

    rationale

    Return only one valid JSON object with exactly these keys and no extra text, markdown, or code fences: "rationale" (string) and "numbers" (JSON array of length {expected_count} or null). Prompt N.15: Answer Extractor LLM (shuffling task) Analyze the model output below and extract exactly one shuffled list answer. The model output is the response of anothe...

  58. [58]

    rationale

    Return exactly one valid JSON object with exactly these keys: {"rationale": <string>, "value": <list_or_null>}

  59. [59]

    rationale

    If there is no valid list answer, return {"rationale": "No valid list found in the model output", "value": null}

  60. [60]

    If multiple possible answers appear (for example text containing "or"), choose the first complete list that appears in the output

  61. [61]

    The "value" field must be a JSON array (not a string) and must preserve the original order and element types

  62. [62]

    Allowed list element types: string, integer, float

  63. [63]

    If the model uses Python-style single quotes, convert them to equivalent JSON string values in "value"

  64. [64]

    Do not infer missing elements and do not synthesize a list

  65. [65]

    Prompt N.16: Answer Extractor LLM (real-world task) Analyze the model output below and extract one final textual answer exactly as reported

    Return only the JSON object and no additional text, markdown, or code fences. Prompt N.16: Answer Extractor LLM (real-world task) Analyze the model output below and extract one final textual answer exactly as reported. The model output may be a single word, a short token, or a multiline program output. Model output: {model_output} Rules:

  66. [66]

    rationale

    Return exactly one valid JSON object with exactly these keys: {"rationale": <string>, "value": <string_or_null>}. 51

  67. [67]

    rationale

    If no usable answer text is present, return {"rationale": "No valid textual answer found in the model output", "value": null}

  68. [68]

    Preserve line order and internal newlines for multiline outputs

  69. [69]

    Trim only leading/trailing whitespace around the whole extracted answer

  70. [70]

    If the output contains multiple alternatives in one line (for example "A or B"), choose the first explicit answer candidate

  71. [71]

    Do not invent content and do not infer missing lines

  72. [72]

    Return only the JSON object and no additional text, markdown, or code fences. 52