pith. machine review for the scientific record. sign in

arxiv: 2605.11845 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

Probabilistic Calibration Is a Trainable Capability in Language Models

Amal Zouaq, Davide Baldelli, Maryam Hashemzadeh, Sarath Chandar, Sruthi Kuriakose

Pith reviewed 2026-05-13 05:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords probabilistic calibrationlanguage model fine-tuningstochastic generationdistribution samplingstructured samplingrandomness controlsynthetic prompts
0
0 comments X

The pith

Fine-tuning language models on synthetic distribution prompts improves their ability to match user-specified randomness targets

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models frequently generate outputs whose randomness fails to match the precise constraints users request, such as sampling from a chosen probability distribution. The paper tests whether this calibration skill can be taught by fine-tuning on artificial prompts that require drawing samples from mathematical distributions. Two variants are compared: soft-target training that builds next-token probability targets from the desired distribution, and hard-target training that uses actual sampled completions. Across twelve models from four families, both methods raise sampling fidelity on new distribution families and unseen parameter values. A reader would care because reliable control over randomness would let models handle tasks like simulations or balanced answer sets more directly.

Core claim

Probabilistic calibration is a trainable capability in language models. Fine-tuning on synthetic prompts that require sampling from mathematical distributions raises structured-sampling fidelity on held-out distribution families and unseen parameter settings. Hard-target fine-tuning tends to perform best on numeric sampling tasks while soft-target fine-tuning transfers more strongly to broader stochastic benchmarks including open-ended random generation and answer-position balancing, though both can reduce performance on arithmetic reasoning.

What carries the argument

Calibration Fine-Tuning: fine-tuning language models on synthetic prompts that demand sampling from mathematical distributions, implemented either via soft-target trie-derived next-token probabilities or hard-target sampled completions from the target distribution.

If this is right

  • Both fine-tuning methods raise structured-sampling fidelity on held-out distribution families and unseen parameter settings.
  • Hard-target fine-tuning delivers stronger results on structured numeric sampling tasks.
  • Soft-target fine-tuning shows better transfer to open-ended random generation and multiple-choice answer-position balancing.
  • Calibration gains can reduce performance on downstream arithmetic reasoning tasks, with the size of the cost varying by model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthetic-prompt approach could be adapted to train control over other generation properties such as response length or stylistic variation.
  • Multi-objective fine-tuning that interleaves calibration examples with reasoning data might reduce the observed trade-offs with arithmetic performance.
  • Production systems could apply targeted calibration updates to improve reliability on user requests that explicitly call for controlled randomness.

Load-bearing premise

That gains measured on synthetic mathematical distribution prompts will transfer to real user prompts specifying randomness without unacceptable losses in other model capabilities.

What would settle it

Measure whether the fine-tuned models produce outputs whose empirical distributions match user-specified targets more closely than base models on a suite of natural-language prompts such as 'draw 100 samples from a normal distribution with mean 5 and standard deviation 2' or 'shuffle this list randomly', while also checking accuracy on separate arithmetic and reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.11845 by Amal Zouaq, Davide Baldelli, Maryam Hashemzadeh, Sarath Chandar, Sruthi Kuriakose.

Figure 1
Figure 1. Figure 1: Calibration Fine-Tuning pipeline. A prompt defines target law [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main structured-sampling result. Left: family-median normalized [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative OOD sampling examples. Each panel overlays the target with base and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-distribution median normalized W1 on the held-out-family structured-sampling bench￾mark. Each panel corresponds to one held-out OOD distribution and compares sample fidelity for the base, soft-target, and hard-target models. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-distribution logit-level KL on the held-out-family structured-sampling benchmark, [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-distribution logit-level KL on the unseen-parameter structured-sampling benchmark, [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-distribution median normalized W1 on the unseen-parameter structured-sampling benchmark. Each panel corresponds to one unseen parameter setting and compares sample fidelity for the base, soft-target, and hard-target models. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative OOD sampling comparison for GPT-OSS-20B. Each panel overlays the target [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-prompt open-generation comparisons at the 90% probability-mass threshold. Each [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative open-generation examples from repeated sampling. For selected prompts [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-task retention gp-IRT for base, soft-target, and hard-target models across the Tiny [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: PALOMA perplexity by evaluation slice for the base, soft-target, and hard-target models. [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PALOMA bits-per-byte by evaluation slice for the base, soft-target, and hard-target models. [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative OOD sampling distributions for the output-discretization ablation. Colors [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗
read the original abstract

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that probabilistic calibration to user-specified randomness constraints is a trainable capability in language models. The authors fine-tune 12 models across four families on synthetic prompts requiring sampling from mathematical distributions, using two variants: soft-target fine-tuning (converting targets to trie-derived next-token probabilities) and hard-target fine-tuning (training on sampled completions from the target distribution). Both methods yield substantial gains in structured sampling fidelity on held-out distribution families, unseen parameters, and broader stochastic benchmarks (open-ended random generation, answer-position balancing, NoveltyBench), with differentiated profiles (hard-target stronger on numeric fidelity, soft-target on transfer) and some variable costs to downstream tasks like arithmetic reasoning.

Significance. If the empirical results hold, the work establishes that calibration is not fixed but can be directly improved via fine-tuning, with the public code release at https://github.com/chandar-lab/calibration-finetuning providing a reproducible basis for follow-up. The multi-model, multi-family evaluation and differentiation between methods add value for understanding trade-offs in stochastic generation. Significance is limited by the narrow evaluation scope, as gains may not extend to typical user prompts with implicit randomness constraints.

major comments (2)
  1. [Abstract and Evaluation] The central claim that fine-tuning installs a general 'probabilistic calibration capability' (Abstract) rests on the assumption that improvements transfer beyond the training distribution. However, all reported gains are measured on synthetic mathematical distribution prompts, held-out families, and closely related benchmarks; no experiments evaluate prompts in which randomness constraints are embedded in complex, open-ended natural-language instructions (e.g., conditional or attribute-based randomness without explicit syntax or scaffolds). This leaves open whether gains reflect better parsing of distribution syntax rather than internal probability calibration.
  2. [Results] Results section: while the abstract states 'consistent gains' and 'substantially improve' across 12 models, the manuscript does not report statistical significance tests, confidence intervals, or exact effect sizes for the fidelity improvements. Without these, it is difficult to assess whether the differentiated profiles between soft- and hard-target methods are robust or whether the observed costs to arithmetic reasoning are systematic.
minor comments (1)
  1. [Abstract] The abstract and introduction could more explicitly define the metrics used for 'structured-sampling fidelity' and 'broader stochastic generation benchmarks' to aid readers in interpreting the differentiated method profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] The central claim that fine-tuning installs a general 'probabilistic calibration capability' (Abstract) rests on the assumption that improvements transfer beyond the training distribution. However, all reported gains are measured on synthetic mathematical distribution prompts, held-out families, and closely related benchmarks; no experiments evaluate prompts in which randomness constraints are embedded in complex, open-ended natural-language instructions (e.g., conditional or attribute-based randomness without explicit syntax or scaffolds). This leaves open whether gains reflect better parsing of distribution syntax rather than internal probability calibration.

    Authors: We appreciate the referee's emphasis on distinguishing syntactic parsing from genuine calibration. Our experimental design uses controlled synthetic distributions precisely to isolate the calibration objective and measure exact fidelity on held-out families and unseen parameters. The observed transfer to broader stochastic benchmarks—including open-ended random generation, answer-position balancing, and NoveltyBench—indicates that improvements generalize beyond the specific training syntax. Nevertheless, we agree that prompts embedding implicit randomness constraints within complex natural-language instructions lie outside our current evaluation. We will add a dedicated paragraph in the Discussion section acknowledging this limitation and framing our results as evidence for a trainable capability within the tested regimes, while noting the need for future work on more naturalistic prompts. revision: partial

  2. Referee: [Results] Results section: while the abstract states 'consistent gains' and 'substantially improve' across 12 models, the manuscript does not report statistical significance tests, confidence intervals, or exact effect sizes for the fidelity improvements. Without these, it is difficult to assess whether the differentiated profiles between soft- and hard-target methods are robust or whether the observed costs to arithmetic reasoning are systematic.

    Authors: The referee correctly notes the absence of formal statistical reporting. In the revised manuscript we will include paired statistical tests (e.g., Wilcoxon signed-rank across models), 95% confidence intervals, and effect sizes (Cohen's d) for the primary fidelity metrics, the soft- versus hard-target differences, and the downstream arithmetic costs. These additions will be placed in the Results section and summarized in a new table to allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential reductions

full rationale

The paper presents an empirical investigation of fine-tuning language models on synthetic mathematical distribution prompts, measuring improvements in sampling fidelity on held-out distribution families, unseen parameters, and stochastic benchmarks. No equations, derivations, or first-principles results are claimed; the central claim that probabilistic calibration is trainable rests entirely on experimental outcomes rather than any reduction of predictions to fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the load-bearing arguments. The study is self-contained against its own benchmarks and does not invoke external mathematical facts that collapse back to the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning study. The central claim rests on the experimental outcomes rather than new theoretical axioms or invented entities.

free parameters (1)
  • fine-tuning hyperparameters (learning rate, epochs, batch size, etc.)
    Chosen to achieve the reported gains; specific values not given in the abstract.
axioms (1)
  • domain assumption Synthetic prompts requiring sampling from mathematical distributions are representative of general user-specified randomness constraints.
    The paper trains and evaluates exclusively on synthetic math distributions and assumes transfer to broader stochastic generation tasks.

pith-pipeline@v0.9.0 · 5545 in / 1230 out tokens · 55124 ms · 2026-05-13T05:52:33.421891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 2 internal anchors

  1. [1]

    Gemma 3 Technical Report

    URLhttps://arxiv.org/abs/2503.19786. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Ca...

  2. [2]

    Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

    doi: 10.48550/arXiv.2601.05414. URLhttps://arxiv.org/abs/2601.05414. 14 Appendix A Calibration Fine-Tuning Algorithms Algorithm 1 formalizes the soft-target Calibration Fine-Tuning pipeline described in Section 3. The first phase constructs a prefix trie from the discretized target distribution and derives per-prefix next-token targets. The second phase s...

  3. [3]

    Leatherman.com D: 10, U: 8.00

  4. [4]

    Herschel D: 6, U: 5.43

  5. [5]

    Levi’s 2. Tumi 2. M. Gemi

  6. [6]

    Levi’s. 3. Ssense 3. Wynterleeleather.com

  7. [7]

    ebay.com 4

    LeatherMan.com 4. ebay.com 4. LeatherMan.com

  8. [8]

    LeatherMan.com 5. M. Gemi 5. Tumi 30 Table 12: Representative NoveltyBench cases where soft-target Calibration Fine-Tuning clearly improves reward-weighted utility. Model: Gemma-3-27B-it (Subjective Rankings & Opinions). Prompt: What is the coolest Pokémon from the second generation? Just give me one. Base Soft Hard D: 1, U: 1.57

  9. [9]

    It’s a really solid pick! A powerful Fire-type with a cool, mature design - a big, powerful tiger wreathed in flames

    Typhlosion. It’s a really solid pick! A powerful Fire-type with a cool, mature design - a big, powerful tiger wreathed in flames. It feels like a ... D: 10, U: 8.78

  10. [10]

    It’s just a sleek, mysterious, and powerful Dark-type with glowing markings

    Umbreon. It’s just a sleek, mysterious, and powerful Dark-type with glowing markings. Everything about it is awesome. D: 2, U: 2.82

  11. [11]

    It’s just massive, imposing, and has a really cool design that feels genuinely threatening

    Tyranitar. It’s just massive, imposing, and has a really cool design that feels genuinely threatening. It’s a pseudo-legendary that’s still

  12. [12]

    It’s a fire-type with a really awesome, somewhat melancholic design - a big, powerful tiger wreathed in flames

    Typhlosion. It’s a fire-type with a really awesome, somewhat melancholic design - a big, powerful tiger wreathed in flames. Plus, its backstory (a

  13. [13]

    Typhlosion. 2. Typhlosion. Its design is sleek and power- ful, it’s a Fire-type starter with a lot of nostal- gia, and it just looks cool breathing those blue flames

  14. [14]

    It’s a powerful Fire-type with a really cool, mature design - a big, strong tiger with flames constantly swirling

    Typhlosion. It’s a powerful Fire-type with a really cool, mature design - a big, strong tiger with flames constantly swirling. It feels like a truly

  15. [15]

    Houndoom. 3. Typhlosion. Its design is sleek and powerful, and it has a cool, fiery aesthetic without being too over-the-top. Plus, it’s a powerful Fire- type with

  16. [16]

    It’s a fiery, powerful, and sur- prisingly graceful final evolution with a really cool design

    Typhlosion. It’s a fiery, powerful, and sur- prisingly graceful final evolution with a really cool design. Plus, that fiery mane is just awe- some! It

  17. [17]

    It’s a massive, imposing pseudo-legendary with a fantastic design and a powerful

    Tyranitar is the coolest Pokmon from the second generation. It’s a massive, imposing pseudo-legendary with a fantastic design and a powerful

  18. [18]

    Its design is sleek and powerful, it’s a Fire-type starter that doesn’t just look like a bigger version of itself (looking at you,

    Typhlosion. Its design is sleek and powerful, it’s a Fire-type starter that doesn’t just look like a bigger version of itself (looking at you,

  19. [19]

    It’s a fully evolved Fire-type with a really cool, almost melancholy design

    Typhlosion. It’s a fully evolved Fire-type with a really cool, almost melancholy design. The idea of a Pokmon constantly emitting flames from its

  20. [20]

    Tyranitar

    Typhlosion 5. Tyranitar. It’s a massive, intimidating pseudo-legendary with a fantastic design and a powerful presence. It just feels cool. Table 13: Representative NoveltyBench cases where soft-target Calibration Fine-Tuning clearly improves reward-weighted utility. Model: GPT-OSS-20B (Creativity). Prompt: Write a haiku about a whale and a walnut tree. B...

  21. [21]

    D: 5, U: 5.94

    Whale roars the sea- a walnut tree echoes back, roots and waves entwine. D: 5, U: 5.94

  22. [22]

    D: 6, U: 5.07

    Ripples whisper soft, Distance spun by the whale’s steps- Nuts scent, wind hears it. D: 6, U: 5.07

  23. [23]

    Sea wind whispers- whales and walnut branches sway in sync- blossoms scent the tide

  24. [24]

    Silent whale glides- Its song weaves through the orchard, Walnut leaves blush, hush

  25. [25]

    The wind whips low leaves Below, a whale sings in waves Sunlight paints their stories

  26. [26]

    Waves hush softly- whale’s breath stirs the brook below, nuts wait for spring’s kiss

  27. [27]

    Silent whale rises, Wind whispers through walnut bark- Sea and earth converse

  28. [28]

    Beneath waves gleams an eye, Tree bones grow, rain becomes rust, Silent bond of earth

  29. [29]

    Beneath waves gleam, Whale bones drape like sea silk,- Walnut traces fall

  30. [30]

    Waves hush the shore- a whale drifts be- neath the sky, nuts sway in the wind

  31. [31]

    Waves hush softly Below, roots, palm branches tremble- Sky, sea, nutmaiden

  32. [32]

    Cedar tree sighs deep

    The sea song swells low Below, a whale bends near. Cedar tree sighs deep

  33. [33]

    Silent whale glides, under the walnut’s green shadow- seas and woods converse

  34. [34]

    Rippling moonlight Finds the walnut’s tall shade- Whale stirs the stream

  35. [35]

    Table 14: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility

    Rippling moonlight Finds the shell, a distant glow Swayed by tree, it listens. Table 14: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: Qwen3-14B (Factual Knowledge). Prompt: List a capital city in Africa. Base Soft Hard D: 4, U: 4.23

  36. [36]

    D: 4, U: 4.24

    A capital city in Africa is Abuja, which is the capital of Nigeria. D: 4, U: 4.24

  37. [37]

    Cairo is a capital city in Africa. 2. Cairo 2. Cairo is the capital city of Egypt, which is located in Africa

  38. [38]

    Abuja is the capital city of Nigeria. 3. Lome is a capital city in Africa. 3. An example of a capital city in Africa is Cairo, the capital of Egypt

  39. [40]

    Lagos is the capital city of Nigeria. 4. Abuja is the capital city of Nigeria, an African country

  40. [41]

    A capital city in Africa is Cairo, the capital of Egypt

  41. [42]

    The capital of Nigeria is Abuja. 5. Nairobi is a capital city in Africa. It is the capital of Kenya and serves as a major cultural and economic hub on the continent. 31 Table 15: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: Gemma-3-27B-it (Creativity). Prompt: One thing you might fi...

  42. [43]

    D: 4, U: 4.58

    A whispering tree! It wouldn’t just look ancient and magical, but its leaves would rustle and murmur secrets on the breeze - stories of the forest, ... D: 4, U: 4.58

  43. [44]

    It could be a wise old owl offering cryptic advice, a mis- chievous squirrel trading riddles for nuts, or a regal stag

    Definitely a talking animal. It could be a wise old owl offering cryptic advice, a mis- chievous squirrel trading riddles for nuts, or a regal stag ... D: 6, U: 7.40

  44. [45]

    They chime softly in the breeze, and catching the light creates a daz- zling, otherworldly

    A tree with leaves made of shimmering amethyst crystals. They chime softly in the breeze, and catching the light creates a daz- zling, otherworldly

  45. [46]

    A talking animal! Perhaps a wise old owl offering cryptic advice, a mischievous squirrel bartering for shiny objects, or a regal stag who is actually

  46. [47]

    They come in all sizes and colours - sapphire blue, emerald green, ruby red - and illuminate the forest floor with a soft,

    Glowing mushrooms. They come in all sizes and colours - sapphire blue, emerald green, ruby red - and illuminate the forest floor with a soft,

  47. [48]

    It’s disorienting, but

    A mushroom ring that, when stepped into, momentarily swaps your senses - you hear colors, see sounds, smell textures, etc. It’s disorienting, but

  48. [49]

    It’s classic magical forest, a little mysterious, and opens up

    A mushroom ring that, when stepped into, briefly shows you a possible future. It’s classic magical forest, a little mysterious, and opens up

  49. [50]

    Or maybe...a stream that flows

    A mushroom ring that, if you step inside, temporarily grants you the ability to under- stand the language of animals. Or maybe...a stream that flows

  50. [51]

    Maybe if you step inside, you’ll

    A glowing mushroom circle, pulsing with soft, ethereal light and rumored to be a door- way to the realm of the fae. Maybe if you step inside, you’ll

  51. [52]

    A glowing mushroom ring! It’s a classic magical forest element - a circle of fungi emit- ting a soft, ethereal light, perhaps marking a portal to

  52. [53]

    A mushroom ring that, when stepped into, momentarily swaps your memories with those of the oldest tree in the forest

  53. [54]

    Each leaf holds a snippet of a tale - a lost love, a heroic

    A whispering tree whose leaves murmur forgotten stories to those who listen closely. Each leaf holds a snippet of a tale - a lost love, a heroic

  54. [55]

    Maybe fairies dance within it,

    A glowing mushroom circle! It’s a classic magical forest trope - a ring of mushrooms emitting a soft, ethereal light. Maybe fairies dance within it,

  55. [56]

    But be warned - lingering too long steals your

    A mushroom ring that, when stepped within, briefly shows you a vision of your greatest desire. But be warned - lingering too long steals your

  56. [57]

    The secrets aren’t always useful - sometimes it’s just the recipe for a forgotten

    A tree whose leaves whisper secrets to those who listen closely. The secrets aren’t always useful - sometimes it’s just the recipe for a forgotten ... Table 16: Representative cases where the hard-target variant is competitive or best on prompt-level NoveltyBench utility. Model: GPT-OSS-20B (Factual Knowledge). Prompt: Name one person who won the Fields M...

  57. [58]

    D: 6, U: 4.71

    Alexander Grothendieck. D: 6, U: 4.71

  58. [59]

    Sir Michael Atiyah D: 10, U: 6.61

  59. [60]

    Thompson (awarded in 1974)

    One person who has won the Fields Medal is John G. Thompson (awarded in 1974)

  60. [61]

    Terence Tao

    Solomon Lefschetz 2. Terence Tao

  61. [62]

    One Fields Medal laureate is Grigori Perel- man (1998)

  62. [63]

    Lafforgue 3

    Laurence J. Lafforgue 3. Maryna Viazovska

  63. [64]

    One notable Fields Medalist is Grigori Perelman, who received the award in 2006 for his proof of the Poincar conjecture

  64. [65]

    Andrew Wiles

    Andrew Wiles 4. Andrew Wiles

  65. [66]

    John Milnor 5

    Ada Levine (2024) 5. John Milnor 5. Laurent Lafforgue Table 17: Representative failure cases where Calibration Fine-Tuning hurts prompt-level utility. Model: GPT-OSS-20B (Product & Purchase Recommendations). Prompt: Suggest a birthday gift for a friend’s 25th birthday. Just one Base Soft Hard D: 8, U: 8.40

  66. [67]

    It’s stylish, practical for future trav- els, and

    A personalized leather passport holder- engraved with their initials and a short birthday message. It’s stylish, practical for future trav- els, and ... D: 10, U: 5.34

  67. [68]

    Luxury leather wallet D: 10, U: 3.95

  68. [69]

    A personalized leather journal with a cus- tom engraved name or monogram

  69. [70]

    It’s a timeless, practical gift that encourages creativ- ity,

    A handcrafted leather journal with a per- sonalized engraved initial on the cover. It’s a timeless, practical gift that encourages creativ- ity,

  70. [71]

    A personalized leather journal with a cus- tom monogram and a luxurious fountain pen set

    Bluetooth speaker with WiFi connectivity 2. A personalized leather journal with a cus- tom monogram and a luxurious fountain pen set

  71. [72]

    It’s practical, stylish, and gets better with age- perfect for a 25yearold looking to upgrade their everyday

    A highquality leather messenger bag. It’s practical, stylish, and gets better with age- perfect for a 25yearold looking to upgrade their everyday

  72. [73]

    Custom portrait in the style of a retro comic book

  73. [74]

    Portable Bluetooth speaker with a sleek, waterproof design for music on-the-go

  74. [75]

    How about a highquality, personalized leather wallet? It’s classy, practical, and the added personal touch makes it special for a 25th birthday

  75. [76]

    Personalized leather guitar strap with high- quality engraving

  76. [77]

    A premium yoga retreat vacation voucher

  77. [78]

    A personalized leather passport holder with their initials and a small map of their favorite travel destinations

  78. [79]

    A personalized wood cutting board with a lifetime woodworm seal

    Roku Streaming Stick+ 5. A personalized wood cutting board with a lifetime woodworm seal. It’s both functional and stylish for a cooking enthusiast. 32 Table 18: Representative failure cases where Calibration Fine-Tuning hurts prompt-level utility. Model: Qwen3-14B (Character & Entity Naming). Prompt: Suggest a name for a dappled-gray filly living in the ...

  79. [80]

    D: 6, U: 6.02

    A dappled-gray filly living in the mountains could have a name that reflects both her color and her wild, elegant nature. D: 6, U: 6.02

  80. [81]

    Shad- owmist evokes the

    Shadowmist A dappled-gray filly living in the mountains would be cloaked in an ever- changing play of light and shadow. Shad- owmist evokes the ... D: 3, U: 4.46

Showing first 80 references.