arxiv: 2605.11845 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

Probabilistic Calibration Is a Trainable Capability in Language Models

Amal Zouaq, Davide Baldelli, Maryam Hashemzadeh, Sarath Chandar, Sruthi Kuriakose

Pith reviewed 2026-05-13 05:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords probabilistic calibrationlanguage model fine-tuningstochastic generationdistribution samplingstructured samplingrandomness controlsynthetic prompts

0 comments

The pith

Fine-tuning language models on synthetic distribution prompts improves their ability to match user-specified randomness targets

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models frequently generate outputs whose randomness fails to match the precise constraints users request, such as sampling from a chosen probability distribution. The paper tests whether this calibration skill can be taught by fine-tuning on artificial prompts that require drawing samples from mathematical distributions. Two variants are compared: soft-target training that builds next-token probability targets from the desired distribution, and hard-target training that uses actual sampled completions. Across twelve models from four families, both methods raise sampling fidelity on new distribution families and unseen parameter values. A reader would care because reliable control over randomness would let models handle tasks like simulations or balanced answer sets more directly.

Core claim

Probabilistic calibration is a trainable capability in language models. Fine-tuning on synthetic prompts that require sampling from mathematical distributions raises structured-sampling fidelity on held-out distribution families and unseen parameter settings. Hard-target fine-tuning tends to perform best on numeric sampling tasks while soft-target fine-tuning transfers more strongly to broader stochastic benchmarks including open-ended random generation and answer-position balancing, though both can reduce performance on arithmetic reasoning.

What carries the argument

Calibration Fine-Tuning: fine-tuning language models on synthetic prompts that demand sampling from mathematical distributions, implemented either via soft-target trie-derived next-token probabilities or hard-target sampled completions from the target distribution.

If this is right

Both fine-tuning methods raise structured-sampling fidelity on held-out distribution families and unseen parameter settings.
Hard-target fine-tuning delivers stronger results on structured numeric sampling tasks.
Soft-target fine-tuning shows better transfer to open-ended random generation and multiple-choice answer-position balancing.
Calibration gains can reduce performance on downstream arithmetic reasoning tasks, with the size of the cost varying by model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-prompt approach could be adapted to train control over other generation properties such as response length or stylistic variation.
Multi-objective fine-tuning that interleaves calibration examples with reasoning data might reduce the observed trade-offs with arithmetic performance.
Production systems could apply targeted calibration updates to improve reliability on user requests that explicitly call for controlled randomness.

Load-bearing premise

That gains measured on synthetic mathematical distribution prompts will transfer to real user prompts specifying randomness without unacceptable losses in other model capabilities.

What would settle it

Measure whether the fine-tuned models produce outputs whose empirical distributions match user-specified targets more closely than base models on a suite of natural-language prompts such as 'draw 100 samples from a normal distribution with mean 5 and standard deviation 2' or 'shuffle this list randomly', while also checking accuracy on separate arithmetic and reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.11845 by Amal Zouaq, Davide Baldelli, Maryam Hashemzadeh, Sarath Chandar, Sruthi Kuriakose.

**Figure 2.** Figure 2: Main structured-sampling result. Left: family-median normalized [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative OOD sampling examples. Each panel overlays the target with base and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-distribution median normalized W1 on the held-out-family structured-sampling benchmark. Each panel corresponds to one held-out OOD distribution and compares sample fidelity for the base, soft-target, and hard-target models. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Per-distribution logit-level KL on the held-out-family structured-sampling benchmark, [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Per-distribution logit-level KL on the unseen-parameter structured-sampling benchmark, [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Per-distribution median normalized W1 on the unseen-parameter structured-sampling benchmark. Each panel corresponds to one unseen parameter setting and compares sample fidelity for the base, soft-target, and hard-target models. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative OOD sampling comparison for GPT-OSS-20B. Each panel overlays the target [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Per-prompt open-generation comparisons at the 90% probability-mass threshold. Each [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative open-generation examples from repeated sampling. For selected prompts [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Per-task retention gp-IRT for base, soft-target, and hard-target models across the Tiny [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: PALOMA perplexity by evaluation slice for the base, soft-target, and hard-target models. [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

**Figure 13.** Figure 13: PALOMA bits-per-byte by evaluation slice for the base, soft-target, and hard-target models. [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative OOD sampling distributions for the output-discretization ablation. Colors [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗

read the original abstract

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning on synthetic math prompts improves calibration on held-out similar tasks, but transfer to implicit natural-language randomness is untested.

read the letter

The main takeaway is that fine-tuning on synthetic prompts for sampling from mathematical distributions can improve how well language models match target probabilities on similar held-out cases. They test this across 12 models with two variants and see gains, though with some costs to other skills. Hard-target fine-tuning tends to be stronger for precise numeric outputs, while soft-target does better on more open stochastic tasks like random generation and answer balancing. Transfer to new parameters and distribution families is demonstrated, which is the concrete new result. They also document some downsides, like reduced arithmetic performance depending on the model and method. The code release is helpful for checking the details. The limitation is that all tests use explicit mathematical prompts. Nothing checks whether the same fine-tuning helps when randomness constraints are mixed into ordinary natural-language instructions, such as conditional probabilities in stories or decisions. That gap means we don't yet know if this installs a general capability or just improves handling of a narrow prompt style. For anyone building systems that need controlled randomness, like in simulations or decision tools, this is a useful data point. It should go to peer review. The experiments are direct and the claim is testable within its domain.

Referee Report

2 major / 1 minor

Summary. The paper claims that probabilistic calibration to user-specified randomness constraints is a trainable capability in language models. The authors fine-tune 12 models across four families on synthetic prompts requiring sampling from mathematical distributions, using two variants: soft-target fine-tuning (converting targets to trie-derived next-token probabilities) and hard-target fine-tuning (training on sampled completions from the target distribution). Both methods yield substantial gains in structured sampling fidelity on held-out distribution families, unseen parameters, and broader stochastic benchmarks (open-ended random generation, answer-position balancing, NoveltyBench), with differentiated profiles (hard-target stronger on numeric fidelity, soft-target on transfer) and some variable costs to downstream tasks like arithmetic reasoning.

Significance. If the empirical results hold, the work establishes that calibration is not fixed but can be directly improved via fine-tuning, with the public code release at https://github.com/chandar-lab/calibration-finetuning providing a reproducible basis for follow-up. The multi-model, multi-family evaluation and differentiation between methods add value for understanding trade-offs in stochastic generation. Significance is limited by the narrow evaluation scope, as gains may not extend to typical user prompts with implicit randomness constraints.

major comments (2)

[Abstract and Evaluation] The central claim that fine-tuning installs a general 'probabilistic calibration capability' (Abstract) rests on the assumption that improvements transfer beyond the training distribution. However, all reported gains are measured on synthetic mathematical distribution prompts, held-out families, and closely related benchmarks; no experiments evaluate prompts in which randomness constraints are embedded in complex, open-ended natural-language instructions (e.g., conditional or attribute-based randomness without explicit syntax or scaffolds). This leaves open whether gains reflect better parsing of distribution syntax rather than internal probability calibration.
[Results] Results section: while the abstract states 'consistent gains' and 'substantially improve' across 12 models, the manuscript does not report statistical significance tests, confidence intervals, or exact effect sizes for the fidelity improvements. Without these, it is difficult to assess whether the differentiated profiles between soft- and hard-target methods are robust or whether the observed costs to arithmetic reasoning are systematic.

minor comments (1)

[Abstract] The abstract and introduction could more explicitly define the metrics used for 'structured-sampling fidelity' and 'broader stochastic generation benchmarks' to aid readers in interpreting the differentiated method profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and Evaluation] The central claim that fine-tuning installs a general 'probabilistic calibration capability' (Abstract) rests on the assumption that improvements transfer beyond the training distribution. However, all reported gains are measured on synthetic mathematical distribution prompts, held-out families, and closely related benchmarks; no experiments evaluate prompts in which randomness constraints are embedded in complex, open-ended natural-language instructions (e.g., conditional or attribute-based randomness without explicit syntax or scaffolds). This leaves open whether gains reflect better parsing of distribution syntax rather than internal probability calibration.

Authors: We appreciate the referee's emphasis on distinguishing syntactic parsing from genuine calibration. Our experimental design uses controlled synthetic distributions precisely to isolate the calibration objective and measure exact fidelity on held-out families and unseen parameters. The observed transfer to broader stochastic benchmarks—including open-ended random generation, answer-position balancing, and NoveltyBench—indicates that improvements generalize beyond the specific training syntax. Nevertheless, we agree that prompts embedding implicit randomness constraints within complex natural-language instructions lie outside our current evaluation. We will add a dedicated paragraph in the Discussion section acknowledging this limitation and framing our results as evidence for a trainable capability within the tested regimes, while noting the need for future work on more naturalistic prompts. revision: partial
Referee: [Results] Results section: while the abstract states 'consistent gains' and 'substantially improve' across 12 models, the manuscript does not report statistical significance tests, confidence intervals, or exact effect sizes for the fidelity improvements. Without these, it is difficult to assess whether the differentiated profiles between soft- and hard-target methods are robust or whether the observed costs to arithmetic reasoning are systematic.

Authors: The referee correctly notes the absence of formal statistical reporting. In the revised manuscript we will include paired statistical tests (e.g., Wilcoxon signed-rank across models), 95% confidence intervals, and effect sizes (Cohen's d) for the primary fidelity metrics, the soft- versus hard-target differences, and the downstream arithmetic costs. These additions will be placed in the Results section and summarized in a new table to allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential reductions

full rationale

The paper presents an empirical investigation of fine-tuning language models on synthetic mathematical distribution prompts, measuring improvements in sampling fidelity on held-out distribution families, unseen parameters, and stochastic benchmarks. No equations, derivations, or first-principles results are claimed; the central claim that probabilistic calibration is trainable rests entirely on experimental outcomes rather than any reduction of predictions to fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the load-bearing arguments. The study is self-contained against its own benchmarks and does not invoke external mathematical facts that collapse back to the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning study. The central claim rests on the experimental outcomes rather than new theoretical axioms or invented entities.

free parameters (1)

fine-tuning hyperparameters (learning rate, epochs, batch size, etc.)
Chosen to achieve the reported gains; specific values not given in the abstract.

axioms (1)

domain assumption Synthetic prompts requiring sampling from mathematical distributions are representative of general user-specified randomness constraints.
The paper trains and evaluates exclusively on synthetic math distributions and assumes transfer to broader stochastic generation tasks.

pith-pipeline@v0.9.0 · 5545 in / 1230 out tokens · 55124 ms · 2026-05-13T05:52:33.421891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 2 internal anchors

[1]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Ca...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41592-019-0686-2 2020
[2]

Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

doi: 10.48550/arXiv.2601.05414. URLhttps://arxiv.org/abs/2601.05414. 14 Appendix A Calibration Fine-Tuning Algorithms Algorithm 1 formalizes the soft-target Calibration Fine-Tuning pipeline described in Section 3. The first phase constructs a prefix trie from the discretized target distribution and derives per-prefix next-token targets. The second phase s...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05414 1988
[3]

Leatherman.com D: 10, U: 8.00

work page
[4]

Herschel D: 6, U: 5.43

work page
[5]

Levi’s 2. Tumi 2. M. Gemi

work page
[6]

Levi’s. 3. Ssense 3. Wynterleeleather.com

work page
[7]

ebay.com 4

LeatherMan.com 4. ebay.com 4. LeatherMan.com

work page
[8]

LeatherMan.com 5. M. Gemi 5. Tumi 30 Table 12: Representative NoveltyBench cases where soft-target Calibration Fine-Tuning clearly improves reward-weighted utility. Model: Gemma-3-27B-it (Subjective Rankings & Opinions). Prompt: What is the coolest Pokémon from the second generation? Just give me one. Base Soft Hard D: 1, U: 1.57

work page
[9]

It’s a really solid pick! A powerful Fire-type with a cool, mature design - a big, powerful tiger wreathed in flames

Typhlosion. It’s a really solid pick! A powerful Fire-type with a cool, mature design - a big, powerful tiger wreathed in flames. It feels like a ... D: 10, U: 8.78

work page
[10]

It’s just a sleek, mysterious, and powerful Dark-type with glowing markings

Umbreon. It’s just a sleek, mysterious, and powerful Dark-type with glowing markings. Everything about it is awesome. D: 2, U: 2.82

work page
[11]

It’s just massive, imposing, and has a really cool design that feels genuinely threatening

Tyranitar. It’s just massive, imposing, and has a really cool design that feels genuinely threatening. It’s a pseudo-legendary that’s still

work page
[12]

It’s a fire-type with a really awesome, somewhat melancholic design - a big, powerful tiger wreathed in flames

Typhlosion. It’s a fire-type with a really awesome, somewhat melancholic design - a big, powerful tiger wreathed in flames. Plus, its backstory (a

work page
[13]

Typhlosion. 2. Typhlosion. Its design is sleek and power- ful, it’s a Fire-type starter with a lot of nostal- gia, and it just looks cool breathing those blue flames

work page
[14]

It’s a powerful Fire-type with a really cool, mature design - a big, strong tiger with flames constantly swirling

Typhlosion. It’s a powerful Fire-type with a really cool, mature design - a big, strong tiger with flames constantly swirling. It feels like a truly

work page
[15]

Houndoom. 3. Typhlosion. Its design is sleek and powerful, and it has a cool, fiery aesthetic without being too over-the-top. Plus, it’s a powerful Fire- type with

work page
[16]

It’s a fiery, powerful, and sur- prisingly graceful final evolution with a really cool design

Typhlosion. It’s a fiery, powerful, and sur- prisingly graceful final evolution with a really cool design. Plus, that fiery mane is just awe- some! It

work page
[17]

It’s a massive, imposing pseudo-legendary with a fantastic design and a powerful

Tyranitar is the coolest Pokmon from the second generation. It’s a massive, imposing pseudo-legendary with a fantastic design and a powerful

work page
[18]

Its design is sleek and powerful, it’s a Fire-type starter that doesn’t just look like a bigger version of itself (looking at you,

Typhlosion. Its design is sleek and powerful, it’s a Fire-type starter that doesn’t just look like a bigger version of itself (looking at you,

work page
[19]

It’s a fully evolved Fire-type with a really cool, almost melancholy design

Typhlosion. It’s a fully evolved Fire-type with a really cool, almost melancholy design. The idea of a Pokmon constantly emitting flames from its

work page
[20]

Tyranitar

Typhlosion 5. Tyranitar. It’s a massive, intimidating pseudo-legendary with a fantastic design and a powerful presence. It just feels cool. Table 13: Representative NoveltyBench cases where soft-target Calibration Fine-Tuning clearly improves reward-weighted utility. Model: GPT-OSS-20B (Creativity). Prompt: Write a haiku about a whale and a walnut tree. B...

work page
[21]

D: 5, U: 5.94

Whale roars the sea- a walnut tree echoes back, roots and waves entwine. D: 5, U: 5.94

work page
[22]

D: 6, U: 5.07

Ripples whisper soft, Distance spun by the whale’s steps- Nuts scent, wind hears it. D: 6, U: 5.07

work page
[23]

Sea wind whispers- whales and walnut branches sway in sync- blossoms scent the tide

work page
[24]

Silent whale glides- Its song weaves through the orchard, Walnut leaves blush, hush

work page
[25]

The wind whips low leaves Below, a whale sings in waves Sunlight paints their stories

work page
[26]

Waves hush softly- whale’s breath stirs the brook below, nuts wait for spring’s kiss

work page
[27]

Silent whale rises, Wind whispers through walnut bark- Sea and earth converse

work page
[28]

Beneath waves gleams an eye, Tree bones grow, rain becomes rust, Silent bond of earth

work page
[29]

Beneath waves gleam, Whale bones drape like sea silk,- Walnut traces fall

work page
[30]

Waves hush the shore- a whale drifts be- neath the sky, nuts sway in the wind

work page
[31]

Waves hush softly Below, roots, palm branches tremble- Sky, sea, nutmaiden

work page
[32]

Cedar tree sighs deep

The sea song swells low Below, a whale bends near. Cedar tree sighs deep

work page
[33]

Silent whale glides, under the walnut’s green shadow- seas and woods converse

work page
[34]

Rippling moonlight Finds the walnut’s tall shade- Whale stirs the stream

work page
[35]

Table 14: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility

Rippling moonlight Finds the shell, a distant glow Swayed by tree, it listens. Table 14: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: Qwen3-14B (Factual Knowledge). Prompt: List a capital city in Africa. Base Soft Hard D: 4, U: 4.23

work page
[36]

D: 4, U: 4.24

A capital city in Africa is Abuja, which is the capital of Nigeria. D: 4, U: 4.24

work page
[37]

Cairo is a capital city in Africa. 2. Cairo 2. Cairo is the capital city of Egypt, which is located in Africa

work page
[38]

Abuja is the capital city of Nigeria. 3. Lome is a capital city in Africa. 3. An example of a capital city in Africa is Cairo, the capital of Egypt

work page
[40]

Lagos is the capital city of Nigeria. 4. Abuja is the capital city of Nigeria, an African country

work page
[41]

A capital city in Africa is Cairo, the capital of Egypt

work page
[42]

The capital of Nigeria is Abuja. 5. Nairobi is a capital city in Africa. It is the capital of Kenya and serves as a major cultural and economic hub on the continent. 31 Table 15: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: Gemma-3-27B-it (Creativity). Prompt: One thing you might fi...

work page
[43]

D: 4, U: 4.58

A whispering tree! It wouldn’t just look ancient and magical, but its leaves would rustle and murmur secrets on the breeze - stories of the forest, ... D: 4, U: 4.58

work page
[44]

It could be a wise old owl offering cryptic advice, a mis- chievous squirrel trading riddles for nuts, or a regal stag

Definitely a talking animal. It could be a wise old owl offering cryptic advice, a mis- chievous squirrel trading riddles for nuts, or a regal stag ... D: 6, U: 7.40

work page
[45]

They chime softly in the breeze, and catching the light creates a daz- zling, otherworldly

A tree with leaves made of shimmering amethyst crystals. They chime softly in the breeze, and catching the light creates a daz- zling, otherworldly

work page
[46]

A talking animal! Perhaps a wise old owl offering cryptic advice, a mischievous squirrel bartering for shiny objects, or a regal stag who is actually

work page
[47]

They come in all sizes and colours - sapphire blue, emerald green, ruby red - and illuminate the forest floor with a soft,

Glowing mushrooms. They come in all sizes and colours - sapphire blue, emerald green, ruby red - and illuminate the forest floor with a soft,

work page
[48]

It’s disorienting, but

A mushroom ring that, when stepped into, momentarily swaps your senses - you hear colors, see sounds, smell textures, etc. It’s disorienting, but

work page
[49]

It’s classic magical forest, a little mysterious, and opens up

A mushroom ring that, when stepped into, briefly shows you a possible future. It’s classic magical forest, a little mysterious, and opens up

work page
[50]

Or maybe...a stream that flows

A mushroom ring that, if you step inside, temporarily grants you the ability to under- stand the language of animals. Or maybe...a stream that flows

work page
[51]

Maybe if you step inside, you’ll

A glowing mushroom circle, pulsing with soft, ethereal light and rumored to be a door- way to the realm of the fae. Maybe if you step inside, you’ll

work page
[52]

A glowing mushroom ring! It’s a classic magical forest element - a circle of fungi emit- ting a soft, ethereal light, perhaps marking a portal to

work page
[53]

A mushroom ring that, when stepped into, momentarily swaps your memories with those of the oldest tree in the forest

work page
[54]

Each leaf holds a snippet of a tale - a lost love, a heroic

A whispering tree whose leaves murmur forgotten stories to those who listen closely. Each leaf holds a snippet of a tale - a lost love, a heroic

work page
[55]

Maybe fairies dance within it,

A glowing mushroom circle! It’s a classic magical forest trope - a ring of mushrooms emitting a soft, ethereal light. Maybe fairies dance within it,

work page
[56]

But be warned - lingering too long steals your

A mushroom ring that, when stepped within, briefly shows you a vision of your greatest desire. But be warned - lingering too long steals your

work page
[57]

The secrets aren’t always useful - sometimes it’s just the recipe for a forgotten

A tree whose leaves whisper secrets to those who listen closely. The secrets aren’t always useful - sometimes it’s just the recipe for a forgotten ... Table 16: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: GPT-OSS-20B (Factual Knowledge). Prompt: Name one person who won the Fields M...

work page
[58]

D: 6, U: 4.71

Alexander Grothendieck. D: 6, U: 4.71

work page
[59]

Sir Michael Atiyah D: 10, U: 6.61

work page
[60]

Thompson (awarded in 1974)

One person who has won the Fields Medal is John G. Thompson (awarded in 1974)

work page 1974
[61]

Terence Tao

Solomon Lefschetz 2. Terence Tao

work page
[62]

One Fields Medal laureate is Grigori Perel- man (1998)

work page 1998
[63]

Lafforgue 3

Laurence J. Lafforgue 3. Maryna Viazovska

work page
[64]

One notable Fields Medalist is Grigori Perelman, who received the award in 2006 for his proof of the Poincar conjecture

work page 2006
[65]

Andrew Wiles

Andrew Wiles 4. Andrew Wiles

work page
[66]

John Milnor 5

Ada Levine (2024) 5. John Milnor 5. Laurent Lafforgue Table 17: Representative failure cases where Calibration Fine-Tuning hurts prompt-level utility. Model: GPT-OSS-20B (Product & Purchase Recommendations). Prompt: Suggest a birthday gift for a friend’s 25th birthday. Just one Base Soft Hard D: 8, U: 8.40

work page 2024
[67]

It’s stylish, practical for future trav- els, and

A personalized leather passport holder- engraved with their initials and a short birthday message. It’s stylish, practical for future trav- els, and ... D: 10, U: 5.34

work page
[68]

Luxury leather wallet D: 10, U: 3.95

work page
[69]

A personalized leather journal with a cus- tom engraved name or monogram

work page
[70]

It’s a timeless, practical gift that encourages creativ- ity,

A handcrafted leather journal with a per- sonalized engraved initial on the cover. It’s a timeless, practical gift that encourages creativ- ity,

work page
[71]

A personalized leather journal with a cus- tom monogram and a luxurious fountain pen set

Bluetooth speaker with WiFi connectivity 2. A personalized leather journal with a cus- tom monogram and a luxurious fountain pen set

work page
[72]

It’s practical, stylish, and gets better with age- perfect for a 25yearold looking to upgrade their everyday

A highquality leather messenger bag. It’s practical, stylish, and gets better with age- perfect for a 25yearold looking to upgrade their everyday

work page
[73]

Custom portrait in the style of a retro comic book

work page
[74]

Portable Bluetooth speaker with a sleek, waterproof design for music on-the-go

work page
[75]

How about a highquality, personalized leather wallet? It’s classy, practical, and the added personal touch makes it special for a 25th birthday

work page
[76]

Personalized leather guitar strap with high- quality engraving

work page
[77]

A premium yoga retreat vacation voucher

work page
[78]

A personalized leather passport holder with their initials and a small map of their favorite travel destinations

work page
[79]

A personalized wood cutting board with a lifetime woodworm seal

Roku Streaming Stick+ 5. A personalized wood cutting board with a lifetime woodworm seal. It’s both functional and stylish for a cooking enthusiast. 32 Table 18: Representative failure cases where Calibration Fine-Tuning hurts prompt-level utility. Model: Qwen3-14B (Character & Entity Naming). Prompt: Suggest a name for a dappled-gray filly living in the ...

work page
[80]

D: 6, U: 6.02

A dappled-gray filly living in the mountains could have a name that reflects both her color and her wild, elegant nature. D: 6, U: 6.02

work page
[81]

Shad- owmist evokes the

Shadowmist A dappled-gray filly living in the mountains would be cloaked in an ever- changing play of light and shadow. Shad- owmist evokes the ... D: 3, U: 4.46

work page

Showing first 80 references.