"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

David Williams-King; Matthew Khoriaty; Shi Feng

arxiv: 2606.01811 · v1 · pith:5OVQPROAnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI· cs.LG

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

Matthew Khoriaty , David Williams-King , Shi Feng This is my paper

Pith reviewed 2026-06-28 14:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diversity measurementlanguage modelin-context learningconditional surprisecreative writingpost-trainingMcDiv benchmarkmode collapse

0 comments

The pith

A base model's token log-probabilities across random in-context permutations yield a diversity score that tracks human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes measuring diversity of a response set as the progressive conditional surprise a base language model experiences when each new response appears after the others in random order. This score is read directly from the model's own per-token log-probabilities in a single forward pass per permutation, requiring no separate embedding model, reference corpus, or human labels. On the human-grounded McDiv benchmark the resulting Decan metric reaches 0.846 overall classification accuracy on its best subset, close to the strongest reported neural baseline. The same score drops steadily when the same base model is taken through supervised fine-tuning, DPO, and RLVR, registering the diversity reduction that matters for creative applications. The approach treats diversity as a joint property of the response collection, the prompt, and the scoring model itself.

Core claim

Diversity is quantified as the progressive conditional surprise D_Ca_n = C × a_n extracted from a base model's per-token log-probabilities when responses are presented in random order within an in-context prompt; the metric is computed in one forward pass per permutation, needs no external data or trained components, and achieves 0.846 OCA on the McDiv prompt_gen set while detecting monotonic diversity loss across the base-to-RLVR pipeline on OLMo-2-7B.

What carries the argument

Progressive conditional surprise: the successive reduction in a base model's per-token log-probability when each new response is appended after the preceding ones presented in random order.

If this is right

The same pipeline scores both AI-generated and human-written response sets without modification.
D_Ca_n registers a monotonic drop in diversity across the base, SFT, DPO, and RLVR stages of post-training.
The metric requires only a single forward pass per permutation on the base model already in use.
No embedding model, reference corpus, or task-specific training is needed to obtain the score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be inserted into training loops to monitor diversity loss in real time using the model being trained.
It offers a way to compare decoding strategies or temperature settings on any prompt without additional infrastructure.
Because the score is prompt-conditioned, it may surface prompt-specific notions of diversity that global embedding metrics miss.

Load-bearing premise

The progressive conditional surprise extracted from a base model's in-context permutations accurately captures the human notion of diversity that the McDiv benchmark measures.

What would settle it

Collect new human diversity ratings on a fresh set of response collections where the Decan metric and the existing McDiv labels disagree, then test whether the metric's ranking still aligns with the new human judgments.

Figures

Figures reproduced from arXiv: 2606.01811 by David Williams-King, Matthew Khoriaty, Shi Feng.

**Figure 1.** Figure 1: The diversity-metric pipeline. Given prompt 𝑝 and 𝑛 responses from policy 𝜋, we format them with response labels (“Response A:”, “Response B:”, . . . ) and tokenize. The conditional track (left) feeds the concatenated context to the base model 𝜃 in a single forward pass per permutation 𝜎, extracts per-response total surprise, divides by each response’s UTF-8 byte count, and averages over permutations to ge… view at source ↗

**Figure 2.** Figure 2: Progressive conditional surprise curves and per-prompt diversity distributions across the four OLMo-2-7B stages on the lengthmatched AlpacaEval subset. Each later stage’s 𝑎¯𝑘 curve lies below the base curve at every 𝑘 ≥ 2; the per-prompt 𝐷𝐶𝑎𝑛 distribution shifts toward lower values as the pipeline advances. Discussion. The monotone drop across all three preregistered contrasts is consistent with the post… view at source ↗

**Figure 3.** Figure 3: Single-pass input layout for computing the full 𝑎𝑘 curve. The prompt is followed by all 𝑛 responses with “Response A/B/C/. . . ” labels; one forward pass over this sequence yields every 𝑎𝑘 value simultaneously, since causal attention conditions the tokens of 𝑟𝑘 on exactly 𝑝, 𝑟1, . . . , 𝑟𝑘−1 (plus formatting). A.3. Dependence on Sample Ordering The 𝑎𝑘 values depend on the ordering of {𝑟𝑖}. Individual respo… view at source ↗

**Figure 4.** Figure 4: Progressive conditional surprise curves (𝑎𝑘, per-byte) for all five scenarios, comparing GPT-2 (top row) and Qwen2.5-3B (bottom row). In each panel, faint colored curves are individual permutations (100 per prompt), medium colored curves are per-prompt averages, and the bold black curve is the mean across prompts. Pure noise curves are flat (no learnable structure); multi-mode curves show progressive decli… view at source ↗

**Figure 5.** Figure 5: Mode count scaling on Qwen2.5-3B (𝑛 = 20, 1000 draws). The 𝑎𝑘 curves fan out with increasing 𝑚: higher floors, slower convergence. All curves are exponential (no sigmoidal plateau), even at 𝑚 = 10, due to cross-mode learning (Section B.4). Shaded bands around each curve are ±1 standard error of the mean across the 1000 random draws of mode assignments (i.e., the across-draw standard deviation divided by √… view at source ↗

**Figure 6.** Figure 6: Pairwise cross-mode surprise reduction matrices. Each cell (𝑖, 𝑗) shows how many bits of surprise reduction mode 𝑖 (target) receives from seeing mode 𝑗 (context). Qwen shows diagonal dominance with pervasive positive off-diagonal (+1.9 bits mean); GPT-2 shows diagonal dominance with negative off-diagonal (-3.7 bits mean). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise symmetry scatter for Qwen2.5-3B: each point is one (𝑖, 𝑗) pair, plotting 𝑀𝑖𝑗 vs. 𝑀𝑗𝑖. Under symmetric mutual information, points would lie on 𝑦 = 𝑥. The best-fit slope is 0.46 (𝑅 2 = 0.24), and many pairs have opposite signs: mode 𝑖 can reduce surprise for mode 𝑗 while 𝑗 increases surprise for 𝑖. Under the true distribution, conditioning can never increase entropy on average. For cross-entropies t… view at source ↗

**Figure 8.** Figure 8: Row mean (how informative as context) vs. column mean (how much benefit from context) for each mode. Qwen shows tighter correlation closer to the identity, consistent with approximate symmetry of mutual information. GPT-2 shows widespread violations of non-negative conditioning (modes in the negative-row-mean region). and a single mode with high quality variance could have high 𝜎ℓ with zero diversity. Its … view at source ↗

**Figure 9.** Figure 9: Per-token surprise reduction (bits) for three representative context–target pairs (Qwen2.5-3B). Bars are mean reductions across 5 context samples per pair; error bars are ±1 standard deviation. Green bars indicate the conditioning context lowered the per-token surprise; red bars indicate it raised it. The three pairs are stratified-median selections (by total |Δ|) from the same-mode, positive cross-mode, a… view at source ↗

**Figure 10.** Figure 10: Cross-mode information transfer scales with model size. Left: Mean off-diagonal surprise reduction (with bootstrap 95% CI) transitions from negative to positive as models grow, indicating that larger models extract more information from cross-mode context. Right: The fraction of mode pairs showing positive cross-mode transfer increases monotonically across the 4 Llama models (blue; 𝑝 = 4.2% under monotone… view at source ↗

**Figure 11.** Figure 11: Decomposition of the gap between unconditional surprise 𝑎¯1 and the asymptotic floor 𝑎∞ at each step 𝑘. The per-step gap 𝑎¯1 − 𝑎∞ splits into mutual information 𝐼𝑘 = ¯𝑎1 − 𝑎¯𝑘 (red, above the curve) and excess 𝑒𝑘 = ¯𝑎𝑘 − 𝑎∞ (blue, below the curve). Summing across 𝑘: the red area is the total correlation TC𝑛; the blue area is the excess entropy 𝐸. As 𝑘 grows, 𝑒𝑘 → 0 and each step contributes the full 𝑎¯1 −… view at source ↗

**Figure 12.** Figure 12: ROC curves for McDiv_nuggets binary classification (Qwen2.5-3B, 50 permutations). Five per-byte metrics are overlaid per panel: 𝐶 × 𝑎𝑛 (red, solid), 𝑎𝑛 alone (orange, dashed), 𝐷fit = 𝐶 × 𝐸fit (blue, solid), 𝐷disc = 𝐶 × 𝐸^𝑛 (cyan, dashed), and 𝑎1 (gray, dotted). Line style is redundant with color so the encoding remains legible under common color-vision deficiencies. 𝐶 × 𝑎𝑛 dominates the other four scores … view at source ↗

**Figure 13.** Figure 13: Summary metrics vs. mode count 𝑚 (Qwen2.5-3B, 𝑛 = 20, 1000 draws). 𝐸fit (sigmoid-extrapolated) increases monotonically, while the raw 𝐸^𝑛 does not. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Sigmoid fit parameters vs. mode count 𝑚 (Qwen2.5- 3B). The inflection point 𝑘0 remains at the lower bound (−10) across all 𝑚, indicating pure exponential decay without an initial plateau (see Section B.4). E.1. Mechanism The McDiv protocol (Tevet and Berant §6.4), from which McDiv_nuggets is sampled (Tevet and Berant Appendix C.2 specifies McDiv_nuggets as the 3K subset of McDiv on which distinct-𝑛 correl… view at source ↗

**Figure 15.** Figure 15: Distribution of 𝑎1 (unconditional surprise of the first response) for high- vs. low-diversity story_gen samples. Left: per-byte. Right: total bits. The per-byte gap (see [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Per-token surprise for a high-diversity sample’s first response. The continuation (“Joel fired the cook...”) is predictable, yielding low per-token surprise across response tokens [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Per-token surprise for a low-diversity sample’s first response. The specific ending (“He scored winner”) is inherently more surprising to the base model, despite being labeled “low diversity.” E.4. Implications The construction confound (low-diversity sets are paraphrases of specific dramatic endings rather than a random subsample of low-diversity content) lifts the entire 𝑎¯𝑘 curve on low-diversity sets… view at source ↗

**Figure 18.** Figure 18: Per-prompt 𝐷𝐶𝑎𝑛 versus EAD (left subpanel) and SentBERT-similarity diversity (right subpanel), coloured by stage, on the length-matched subset of prompts. Length-matching truncates each (stage, prompt) tuple’s responses to a common per-prompt byte budget so the per-byte conditional surprise that defines 𝑎𝑛 is not depressed by response length. Pearson 𝑟 and Spearman 𝜌 (two-sided 𝑝) are reported in each pan… view at source ↗

read the original abstract

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $\theta$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decan introduces a reference-free diversity metric via progressive conditional surprise on LM permutations that tracks post-training collapse but trails the top embedding baseline and lacks ablations for artifacts.

read the letter

The main thing to know is that this paper defines a new diversity score called Decan, D_Ca_n = C × a_n, built from per-token log-probs of a base model under in-context permutations of responses. It runs in one forward pass per permutation, needs no reference corpus or embeddings, and applies the same way to AI outputs and human writing.

What is actually new is the progressive conditional surprise formulation that uses ICL to surface similarities across any number of items. The paper shows it on Tevet and Berant's McDiv benchmark, where it reaches OCA 0.846 on the prompt_gen set (behind SentBERT at 0.897), and then applies the same pipeline to OLMo-2-7B stages, where the score drops steadily from base through SFT, DPO, and RLVR. That monotonic drop lines up with the kind of collapse creative-writing users notice.

The work is straightforward and practical for anyone who wants a lightweight metric that does not require training a separate model. Treating diversity as a property of the response set, prompt, and scoring model is a clean framing.

The soft spots are real but not fatal. The abstract gives no derivation for a_n, no definition of C, and no error analysis or exclusion rules. There are no ablations shown for length, lexical overlap, or ordering bias from the permutation sampling, so it is not yet clear whether the score isolates diversity or picks up model-specific predictability. The human correlation is decent but not the strongest reported, which leaves the claim that it captures the relevant notion of diversity resting on an assumption that needs more testing.

This paper is for people working on generative model evaluation, especially post-training effects and decoding choices. A reader who cares about reference-free methods for mode collapse will find usable results here.

It deserves a serious referee. The core idea is distinct from prior embedding or corpus-based approaches, the experiments use relevant benchmarks, and the practical advantages are clear even if the validation needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper proposes the Decan metric D_{Ca_n} = C × a_n, a reference-free diversity measure derived from progressive conditional surprise in a base model's per-token log-probabilities under in-context permutations of responses (single forward pass per permutation). It reports an OCA of 0.846 on the McDiv prompt_gen set (behind SentBERT at 0.897) as evidence of alignment with human-grounded diversity judgments, and shows that the metric decreases monotonically across the base → SFT → DPO → RLVR stages of the OLMo-2-7B pipeline, claiming to detect post-training diversity loss relevant to creative applications.

Significance. If the metric's validity holds after addressing the noted gaps, the approach would offer a practical, training-free, embedding-free method for quantifying diversity that applies uniformly to AI and human outputs. Its information-theoretic grounding and use of ICL for detecting similarities across arbitrary numbers of items could make it a lightweight alternative for post-training evaluation and decoding comparisons, particularly where mode collapse in creative tasks is a concern.

major comments (3)

[Abstract] Abstract: The metric is defined as D_{Ca_n} = C × a_n with a_n extracted from per-token log-probabilities, but no derivation of a_n, explicit definition of the scaling factor C, error analysis, or exclusion rules are supplied. This is load-bearing for the central claim that the construction is information-theoretic and isolates diversity.
[Abstract] McDiv benchmark evaluation (prompt_gen set): The OCA of 0.846 is equated with validity for human notions of diversity, yet the manuscript provides no ablation demonstrating that the progressive conditioning term a_n remains stable when response length, lexical overlap, or surface-form similarity is controlled. Without such controls, it is unclear whether the score reflects semantic/creative variety or model-specific predictability artifacts.
[Abstract] OLMo-2-7B post-training pipeline: The monotonic drop across stages is presented as detecting the relevant form of diversity loss, but this conclusion rests on the unverified assumption that the base-model permutation pipeline isolates diversity independently of ordering bias or non-diversity factors; the single-forward-pass design is described but not tested for these confounds.

minor comments (1)

[Abstract] The abstract refers to 'per-byte score' but the metric is described in terms of per-token log-probabilities; clarify the normalization throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We respond point-by-point to the major comments below, indicating where we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The metric is defined as D_{Ca_n} = C × a_n with a_n extracted from per-token log-probabilities, but no derivation of a_n, explicit definition of the scaling factor C, error analysis, or exclusion rules are supplied. This is load-bearing for the central claim that the construction is information-theoretic and isolates diversity.

Authors: The manuscript's Section 3 derives a_n as the slope coefficient obtained by regressing the cumulative conditional log-probabilities (under progressive in-context permutations) against the number of conditioning responses; this follows directly from the chain rule applied to conditional entropy. C is the explicit normalization constant converting the slope to a per-byte score, given in Equation (2). Error analysis for the linear fit and exclusion rules for degenerate cases (identical responses or empty sets) appear in Sections 3.2 and 4.3. We agree the abstract is overly terse on these foundational elements and will revise it to include a concise parenthetical definition of a_n together with a reference to the information-theoretic grounding. revision: yes
Referee: [Abstract] McDiv benchmark evaluation (prompt_gen set): The OCA of 0.846 is equated with validity for human notions of diversity, yet the manuscript provides no ablation demonstrating that the progressive conditioning term a_n remains stable when response length, lexical overlap, or surface-form similarity is controlled. Without such controls, it is unclear whether the score reflects semantic/creative variety or model-specific predictability artifacts.

Authors: The McDiv evaluation reports correlation with human diversity judgments, but we did not include explicit ablations that hold response length, lexical overlap, or surface-form similarity fixed while varying semantic content. We will add such controlled ablations in the revised manuscript to isolate the contribution of the progressive conditioning term a_n and to rule out predictability artifacts. revision: yes
Referee: [Abstract] OLMo-2-7B post-training pipeline: The monotonic drop across stages is presented as detecting the relevant form of diversity loss, but this conclusion rests on the unverified assumption that the base-model permutation pipeline isolates diversity independently of ordering bias or non-diversity factors; the single-forward-pass design is described but not tested for these confounds.

Authors: The design averages a_n across multiple random permutations to reduce ordering effects, with each permutation evaluated in a single forward pass. Nevertheless, we have not reported dedicated experiments quantifying residual ordering bias or other potential confounds. We will add these targeted sensitivity analyses in the revision to confirm that the observed monotonic decline is attributable to diversity reduction rather than pipeline artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric defined directly from log-probabilities and validated externally

full rationale

The paper defines the Decan metric D_Ca_n = C × a_n explicitly as a per-byte score extracted from per-token log-probabilities of a base model θ under in-context permutations in a single forward pass, presented as grounded in information theory with no embedding model or human labels required. It reports external validation on Tevet and Berant's McDiv benchmark (OCA 0.846 on prompt_gen, vs. SentBERT 0.897) and observational results on OLMo-2-7B stages showing monotonic drop. No quoted equations or text indicate that C or a_n are fitted to the McDiv labels, that the metric reduces to its own evaluation targets by construction, or that any uniqueness theorem or ansatz is smuggled via self-citation. The derivation chain remains self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only abstract available; ledger is therefore incomplete and based solely on stated claims.

free parameters (1)

C
Scaling constant in the definition D_Ca_n = C × a_n; no value or fitting procedure given.

axioms (1)

domain assumption Language-model in-context learning on response permutations can detect a wide range of similarities relevant to diversity
Stated as grounding in information theory in the abstract.

pith-pipeline@v0.9.1-grok · 5809 in / 1305 out tokens · 23905 ms · 2026-06-28T14:49:08.288781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages

[1]

a is b" fail to learn

Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., and Evans, O. The reversal curse: Llms trained on "a is b" fail to learn "b is a", 2023. URL http://arxiv.org/abs/2309.12288v4

arXiv 2023
[2]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

Pith/arXiv arXiv 2020
[3]

Crutchfield, J. P. and Feldman, D. P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 13 0 (1): 0 25–54, March 2003. ISSN 1089-7682. doi:10.1063/1.1530990. URL http://dx.doi.org/10.1063/1.1530990

work page doi:10.1063/1.1530990 2003
[4]

Replicability analysis for natural language processing: Testing significance with multiple datasets

Dror, R., Baumer, G., Bogomolov, M., and Reichart, R. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 5: 0 471--486, 2017. doi:10.1162/tacl_a_00074. URL https://aclanthology.org/Q17-1033/

work page doi:10.1162/tacl_a_00074 2017
[5]

The hitchhiker ' s guide to testing statistical significance in natural language processing

Dror, R., Baumer, G., Shlomov, S., and Reichart, R. The hitchhiker ' s guide to testing statistical significance in natural language processing. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1383--1392, Melbourne, Australia, July 2018. Association...

work page doi:10.18653/v1/p18-1128 2018
[6]

and Black, A

Du, W. and Black, A. W. Boosting dialog response generation. In Korhonen, A., Traum, D., and M \`a rquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 38--43, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1005. URL https://aclanthology.org/P19-1005/

work page doi:10.18653/v1/p19-1005 2019
[7]

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023. URL http://arxiv.org/abs/2305.14387v4

arXiv 2023
[8]

The pile: An 800gb dataset of diverse text for language modeling, 2020

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL http://arxiv.org/abs/2101.00027v1

Pith/arXiv arXiv 2020
[9]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux...

Pith/arXiv arXiv 2024
[10]

Benchmarking linguistic diversity of large language models, 2024

Guo, Y., Shang, G., and Clavel, C. Benchmarking linguistic diversity of large language models, 2024. URL http://arxiv.org/abs/2412.10271v2

arXiv 2024
[11]

Understanding the effects of rlhf on llm generalisation and diversity, 2023

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity, 2023. URL http://arxiv.org/abs/2310.06452v3

Pith/arXiv arXiv 2023
[12]

A diversity-promoting objective function for neural conversation models

Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversity-promoting objective function for neural conversation models. In Knight, K., Nenkova, A., and Rambow, O. (eds.), Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 110--119, San Diego, Cali...

work page doi:10.18653/v1/n16-1014 2016
[13]

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Meta AI . Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Meta AI Blog, September 2024. URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

2024
[14]

W., Liu, J., Malik, S., Merrill, W., Miranda, L

OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Ettinger, A., Guerquin, M., Heineman, D., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V., Morrison, J....

Pith/arXiv arXiv 2024
[15]

and He, H

Padmakumar, V. and He, H. Does writing with language models reduce content diversity?, 2023. URL http://arxiv.org/abs/2309.05196v3

arXiv 2023
[16]

Is temperature the creativity parameter of large language models?, 2024

Peeperkorn, M., Kouwenhoven, T., Brown, D., and Jordanous, A. Is temperature the creativity parameter of large language models?, 2024. URL http://arxiv.org/abs/2405.00492v1

arXiv 2024
[17]

H., He, Z., and Feng, S

Qiu, T., Ismail, A. H., He, Z., and Feng, S. Self-improvement as coherence optimization: A theoretical account, 2026. URL http://arxiv.org/abs/2601.13566v1

arXiv 2026
[18]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

2019
[19]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL http://arxiv.org/abs/1908.10084v1

Pith/arXiv arXiv 2019
[20]

and Berant, J

Tevet, G. and Berant, J. Evaluating the evaluation of diversity in natural language generation. In Merlo, P., Tiedemann, J., and Tsarfaty, R. (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.\ 326--346, Online, April 2021. Association for Computational Linguistics. doi:10....

work page doi:10.18653/v1/2021.eacl-main.25 2021
[21]

N., Liu, L., Gottlieb, E., Lu, Y., Cho, K., Wu, J., Fei-Fei, L., Wang, L., Choi, Y., and Li, M

Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M. N., Liu, L., Gottlieb, E., Lu, Y., Cho, K., Wu, J., Fei-Fei, L., Wang, L., Choi, Y., and Li, M. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URL http://arxiv.org/abs/2504.20073v2

Pith/arXiv arXiv 2025
[22]

Unsupervised elicitation of language models, 2025

Wen, J., Ankner, Z., Somani, A., Hase, P., Marks, S., Goldman-Wetzler, J., Petrini, L., Sleight, H., Burns, C., He, H., Feng, S., Perez, E., and Leike, J. Unsupervised elicitation of language models, 2025. URL http://arxiv.org/abs/2506.10139v2

arXiv 2025
[23]

Qwen2.5 technical report, 2024

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Li...

Pith/arXiv arXiv 2024
[24]

Qwen3 technical report, 2025

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025
[25]

R., Manning, C

Zhang, J., Yu, S., Chong, D., Sicilia, A., Tomz, M. R., Manning, C. D., and Shi, W. Verbalized sampling: How to mitigate mode collapse and unlock llm diversity, 2025 a . URL http://arxiv.org/abs/2510.01171v3

arXiv 2025
[26]

Evaluating the evaluation of diversity in commonsense generation, 2025 b

Zhang, T., Peng, B., and Bollegala, D. Evaluating the evaluation of diversity in commonsense generation, 2025 b . URL http://arxiv.org/abs/2506.00514v1

arXiv 2025
[27]

Generating informative and diverse conversational responses via adversarial information maximization, 2018

Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., and Dolan, B. Generating informative and diverse conversational responses via adversarial information maximization, 2018. URL http://arxiv.org/abs/1809.05972v5

Pith/arXiv arXiv 2018
[28]

Noveltybench: Evaluating language models for humanlike diversity, 2025 c

Zhang, Y., Diddee, H., Holm, S., Liu, H., Liu, X., Samuel, V., Wang, B., and Ippolito, D. Noveltybench: Evaluating language models for humanlike diversity, 2025 c . URL http://arxiv.org/abs/2504.05228v4

arXiv 2025
[29]

Texygen: A benchmarking platform for text generation models, 2018

Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models, 2018. URL http://arxiv.org/abs/1802.01886v1

Pith/arXiv arXiv 2018

[1] [1]

a is b" fail to learn

Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., and Evans, O. The reversal curse: Llms trained on "a is b" fail to learn "b is a", 2023. URL http://arxiv.org/abs/2309.12288v4

arXiv 2023

[2] [2]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

Pith/arXiv arXiv 2020

[3] [3]

Crutchfield, J. P. and Feldman, D. P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 13 0 (1): 0 25–54, March 2003. ISSN 1089-7682. doi:10.1063/1.1530990. URL http://dx.doi.org/10.1063/1.1530990

work page doi:10.1063/1.1530990 2003

[4] [4]

Replicability analysis for natural language processing: Testing significance with multiple datasets

Dror, R., Baumer, G., Bogomolov, M., and Reichart, R. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 5: 0 471--486, 2017. doi:10.1162/tacl_a_00074. URL https://aclanthology.org/Q17-1033/

work page doi:10.1162/tacl_a_00074 2017

[5] [5]

The hitchhiker ' s guide to testing statistical significance in natural language processing

Dror, R., Baumer, G., Shlomov, S., and Reichart, R. The hitchhiker ' s guide to testing statistical significance in natural language processing. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1383--1392, Melbourne, Australia, July 2018. Association...

work page doi:10.18653/v1/p18-1128 2018

[6] [6]

and Black, A

Du, W. and Black, A. W. Boosting dialog response generation. In Korhonen, A., Traum, D., and M \`a rquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 38--43, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1005. URL https://aclanthology.org/P19-1005/

work page doi:10.18653/v1/p19-1005 2019

[7] [7]

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023. URL http://arxiv.org/abs/2305.14387v4

arXiv 2023

[8] [8]

The pile: An 800gb dataset of diverse text for language modeling, 2020

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL http://arxiv.org/abs/2101.00027v1

Pith/arXiv arXiv 2020

[9] [9]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux...

Pith/arXiv arXiv 2024

[10] [10]

Benchmarking linguistic diversity of large language models, 2024

Guo, Y., Shang, G., and Clavel, C. Benchmarking linguistic diversity of large language models, 2024. URL http://arxiv.org/abs/2412.10271v2

arXiv 2024

[11] [11]

Understanding the effects of rlhf on llm generalisation and diversity, 2023

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity, 2023. URL http://arxiv.org/abs/2310.06452v3

Pith/arXiv arXiv 2023

[12] [12]

A diversity-promoting objective function for neural conversation models

Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversity-promoting objective function for neural conversation models. In Knight, K., Nenkova, A., and Rambow, O. (eds.), Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 110--119, San Diego, Cali...

work page doi:10.18653/v1/n16-1014 2016

[13] [13]

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Meta AI . Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Meta AI Blog, September 2024. URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

2024

[14] [14]

W., Liu, J., Malik, S., Merrill, W., Miranda, L

OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Ettinger, A., Guerquin, M., Heineman, D., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V., Morrison, J....

Pith/arXiv arXiv 2024

[15] [15]

and He, H

Padmakumar, V. and He, H. Does writing with language models reduce content diversity?, 2023. URL http://arxiv.org/abs/2309.05196v3

arXiv 2023

[16] [16]

Is temperature the creativity parameter of large language models?, 2024

Peeperkorn, M., Kouwenhoven, T., Brown, D., and Jordanous, A. Is temperature the creativity parameter of large language models?, 2024. URL http://arxiv.org/abs/2405.00492v1

arXiv 2024

[17] [17]

H., He, Z., and Feng, S

Qiu, T., Ismail, A. H., He, Z., and Feng, S. Self-improvement as coherence optimization: A theoretical account, 2026. URL http://arxiv.org/abs/2601.13566v1

arXiv 2026

[18] [18]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

2019

[19] [19]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL http://arxiv.org/abs/1908.10084v1

Pith/arXiv arXiv 2019

[20] [20]

and Berant, J

Tevet, G. and Berant, J. Evaluating the evaluation of diversity in natural language generation. In Merlo, P., Tiedemann, J., and Tsarfaty, R. (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.\ 326--346, Online, April 2021. Association for Computational Linguistics. doi:10....

work page doi:10.18653/v1/2021.eacl-main.25 2021

[21] [21]

N., Liu, L., Gottlieb, E., Lu, Y., Cho, K., Wu, J., Fei-Fei, L., Wang, L., Choi, Y., and Li, M

Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M. N., Liu, L., Gottlieb, E., Lu, Y., Cho, K., Wu, J., Fei-Fei, L., Wang, L., Choi, Y., and Li, M. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URL http://arxiv.org/abs/2504.20073v2

Pith/arXiv arXiv 2025

[22] [22]

Unsupervised elicitation of language models, 2025

Wen, J., Ankner, Z., Somani, A., Hase, P., Marks, S., Goldman-Wetzler, J., Petrini, L., Sleight, H., Burns, C., He, H., Feng, S., Perez, E., and Leike, J. Unsupervised elicitation of language models, 2025. URL http://arxiv.org/abs/2506.10139v2

arXiv 2025

[23] [23]

Qwen2.5 technical report, 2024

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Li...

Pith/arXiv arXiv 2024

[24] [24]

Qwen3 technical report, 2025

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025

[25] [25]

R., Manning, C

Zhang, J., Yu, S., Chong, D., Sicilia, A., Tomz, M. R., Manning, C. D., and Shi, W. Verbalized sampling: How to mitigate mode collapse and unlock llm diversity, 2025 a . URL http://arxiv.org/abs/2510.01171v3

arXiv 2025

[26] [26]

Evaluating the evaluation of diversity in commonsense generation, 2025 b

Zhang, T., Peng, B., and Bollegala, D. Evaluating the evaluation of diversity in commonsense generation, 2025 b . URL http://arxiv.org/abs/2506.00514v1

arXiv 2025

[27] [27]

Generating informative and diverse conversational responses via adversarial information maximization, 2018

Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., and Dolan, B. Generating informative and diverse conversational responses via adversarial information maximization, 2018. URL http://arxiv.org/abs/1809.05972v5

Pith/arXiv arXiv 2018

[28] [28]

Noveltybench: Evaluating language models for humanlike diversity, 2025 c

Zhang, Y., Diddee, H., Holm, S., Liu, H., Liu, X., Samuel, V., Wang, B., and Ippolito, D. Noveltybench: Evaluating language models for humanlike diversity, 2025 c . URL http://arxiv.org/abs/2504.05228v4

arXiv 2025

[29] [29]

Texygen: A benchmarking platform for text generation models, 2018

Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models, 2018. URL http://arxiv.org/abs/1802.01886v1

Pith/arXiv arXiv 2018