arxiv: 2605.12906 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Siyuan Liu (IIIS , Tsinghua University) , Tinghong Chen (College of AI , Tsinghua University , Shanghai Qi Zhi Institute) , Xinghan Li (IIIS , Yifei Wang (Amazon AGI SF Lab) , Jingzhao Zhang (IIIS

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data difficultysupervised fine-tuningLLM fine-tuninggeneralization gapextrapolation gapPAC-Bayesian boundsdata selectionSFT

0 comments

The pith

For any fixed data budget in LLM fine-tuning, an optimal difficulty level exists and moves toward harder examples as the budget grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how the difficulty of selected training examples affects supervised fine-tuning of large language models. It finds that no single difficulty works best for every dataset size. Instead, for a given number of examples there is a sweet spot in difficulty, and this sweet spot becomes harder once more data is available. The pattern is traced to a tradeoff in which easy data reduces the gap between training and test distributions while hard data improves performance outside that distribution. Controlled synthetic experiments and PAC-Bayesian bounds are used to isolate and quantify the two gaps.

Core claim

Data difficulty and dataset size interact through a generalization-extrapolation tradeoff. For small budgets, easier examples minimize the in-distribution generalization gap and raise performance. Larger budgets favor harder examples because they shrink the extrapolation gap to unseen cases. The location of the optimum is predicted by PAC-Bayesian bounds that depend on model capacity and data volume.

What carries the argument

The interplay between the in-distribution generalization gap and the extrapolation gap, formalized through PAC-Bayesian bounds.

If this is right

For small fine-tuning sets, selecting easier data reduces the generalization gap and improves accuracy.
Once the data budget exceeds a threshold, selecting harder data improves extrapolation to out-of-distribution cases.
PAC-Bayesian bounds can be used to estimate the optimal difficulty level for given model size and data volume.
Difficulty-based data selection must be adjusted according to total budget rather than applied with a fixed threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curators of SFT datasets may need to measure difficulty distributions at different scales to locate the operating point.
The same tradeoff could inform data mixing strategies when difficulty is combined with other filters such as length or quality.
The mechanism suggests a way to decide when to add harder synthetic or augmented examples during scaling of fine-tuning runs.

Load-bearing premise

The controlled synthetic experiments and PAC-Bayesian analysis capture the dominant mechanism in real LLM fine-tuning on natural language data.

What would settle it

An experiment on real LLMs in which optimal difficulty does not shift toward harder data as the fine-tuning budget increases, or in which measured generalization and extrapolation gaps fail to track the observed performance changes, would falsify the account.

Figures

Figures reproduced from arXiv: 2605.12906 by Jingzhao Zhang (IIIS, Shanghai Qi Zhi Institute), Siyuan Liu (IIIS, Tinghong Chen (College of AI, Tsinghua University, Tsinghua University), Xinghan Li (IIIS, Yifei Wang (Amazon AGI SF Lab).

**Figure 3.** Figure 3: Performance gains over different base models as a function of data size and difficulty, trained on [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: One-dimensional slices of the 2D data size–difficulty experiment on Qwen-2.5-Math-7B from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance gains over different base models on synthetic iGSM data as a function of data difficulty [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Decomposed test results for SFT experiments on the base model Ops[2–8]2k under data sizes of 5k [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the two-gap decomposition in SFT. The generalization gap rises with difficulty, while [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: DFT performance on synthetic iGSM data (base model Ops[2–8]2k) across various data difficulty [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: An example from the iGSM dataset in our work we fix the number of edges according to #edges = op · 4 3 + 1, so that difficulty is effectively controlled by op. Notice that in the iGSM setup, the problem length grows linearly with the number of operations, which is consistent with our length-based difficulty control discussed in previous sections. In the iGSM experiments, all models are trained with a b… view at source ↗

**Figure 10.** Figure 10: Performance gain over base model as a function of data size and difficulty, trained on the OpenMath [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Extension experiments on Llama models and science reasoning tasks. Data difficulty is measured [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Optimal difficulty for LLM SFT shifts toward harder data as budget grows, due to a generalization-extrapolation tradeoff isolated in synthetics and backed by PAC-Bayesian bounds.

read the letter

The main thing to know is that there's no fixed best difficulty for data selection in supervised fine-tuning; instead, the optimum moves toward harder examples once your data budget gets larger, because the generalization gap and extrapolation gap pull in different directions at different scales. The paper isolates this cleanly with controlled synthetic experiments that separate the two gaps, then uses standard PAC-Bayesian bounds to give the pattern some theoretical footing. This accounts for why earlier studies on perplexity or difficulty heuristics kept producing conflicting results—they were probably operating at different data sizes. The synthetic setup is a strength here because it lets them demonstrate the shift without obvious confounds in the controlled regime, and the bounds avoid circularity by building on existing theory rather than redefining quantities from their own prior work. The experiments look reproducible on paper and the citation pattern is normal for the subfield. The main limitation is the step from synthetic tasks to real LLM fine-tuning on natural language. Tokenizer artifacts and pretraining distributions could easily shift where the tradeoff lands, and the paper does not show the pattern surviving those factors. The bounds are applied but their looseness in LLM settings is not quantified. This is useful for anyone doing data curation for SFT at varying scales, especially people who already track generalization versus out-of-distribution behavior. A serious referee should see it—the core mechanism is presented directly enough to be tested and refined, even if the real-data scope needs more work.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that there is no universally optimal data difficulty for supervised fine-tuning (SFT) of LLMs. For a fixed data budget, an optimal difficulty exists and shifts toward harder data as the budget increases. This is demonstrated empirically, explained mechanistically via controlled synthetic experiments that isolate the interplay between the in-distribution generalization gap and the extrapolation gap, and supported by PAC-Bayesian generalization bounds.

Significance. If the results hold, the work clarifies how data size and difficulty jointly determine the generalization-extrapolation tradeoff in SFT, offering concrete guidance for difficulty-based data selection under the studied model and data conditions. The combination of synthetic experiments and PAC-Bayesian analysis provides a mechanistic account that strengthens the empirical findings and distinguishes this contribution from heuristic-based prior work.

major comments (1)

[Synthetic experiments] Synthetic experiments section: the difficulty binning threshold is identified as a free parameter; the central claim that an optimal difficulty exists and shifts with budget size would be strengthened by an explicit robustness check showing that the location of the optimum is insensitive to reasonable variations in this threshold.

minor comments (1)

[Abstract] Abstract: the phrase 'under certain model and data conditions' is appropriately cautious but could be expanded by one sentence to indicate the scope (e.g., synthetic tasks or specific model scales) without lengthening the abstract excessively.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Synthetic experiments] Synthetic experiments section: the difficulty binning threshold is identified as a free parameter; the central claim that an optimal difficulty exists and shifts with budget size would be strengthened by an explicit robustness check showing that the location of the optimum is insensitive to reasonable variations in this threshold.

Authors: We agree that an explicit robustness check would strengthen the central claim. In the revised manuscript we will add a dedicated subsection in the synthetic experiments that varies the binning threshold over a range of reasonable values (e.g., the original threshold together with shifts of ±10 % and ±20 %). For each budget size we will report the location of the optimal difficulty bin and show that it remains stable across these threshold choices, thereby confirming that the observed shift toward harder data is not an artifact of the particular binning parameter. revision: yes

Circularity Check

0 steps flagged

Minor self-citation risk but central claim remains independent

full rationale

The paper grounds its main result in new controlled synthetic experiments isolating the generalization-extrapolation tradeoff plus standard PAC-Bayesian bounds. No equation or claim reduces by construction to a fitted parameter defined from the target quantity, nor does any load-bearing step rely on a self-citation chain that itself assumes the result. The derivation introduces an explanatory mechanism via fresh experiments rather than renaming known patterns or smuggling an ansatz through prior work. A low-level self-citation risk is noted but does not force the central claim, keeping the overall circularity low.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic data distribution isolates generalization and extrapolation gaps in a manner representative of natural language, plus standard PAC-Bayesian assumptions on model priors and loss functions. No new entities are postulated. One free parameter is the precise definition of 'difficulty' used to bin examples, which is fitted or chosen per experiment.

free parameters (1)

difficulty binning threshold
The cutoff used to label examples as easy or hard is chosen or fitted to produce the observed shift; its value is not derived from first principles.

axioms (2)

standard math PAC-Bayesian generalization bounds apply to the fine-tuned LLM under the chosen prior and loss
Invoked to support the theoretical analysis of the generalization-extrapolation tradeoff.
domain assumption Synthetic task distributions faithfully reproduce the relevant generalization and extrapolation behavior of natural language data
Required for the controlled experiments to explain real LLM fine-tuning.

pith-pipeline@v0.9.0 · 5537 in / 1584 out tokens · 41715 ms · 2026-05-14T19:56:54.951641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 18 canonical work pages · 4 internal anchors

[1]

2025 , eprint=

Anchored Supervised Fine-Tuning , author=. 2025 , eprint=

2025
[2]

2025 , eprint=

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification , author=. 2025 , eprint=

2025
[3]

2025 , eprint=

Proximal Supervised Fine-Tuning , author=. 2025 , eprint=

2025
[4]

2025 , eprint=

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum , author=. 2025 , eprint=

2025
[5]

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process , author=. arXiv preprint arXiv:2407.20311 , year=

work page arXiv
[6]

arXiv preprint arXiv:2508.04149 , year=

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap , author=. arXiv preprint arXiv:2508.04149 , year=

work page arXiv
[7]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Open R1: A fully open reproduction of DeepSeek-R1 , url =
[9]

OpenScienceReasoning-2: A Multi-Domain Synthetic Reasoning Dataset , author =
[10]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2025 , journal =

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset , author =. 2025 , journal =

2025
[12]

arXiv preprint arXiv:2309.04564 , year=

When less is more: Investigating data pruning for pretraining llms at scale , author=. arXiv preprint arXiv:2309.04564 , year=

work page arXiv
[13]

arXiv preprint arXiv:2410.09335 , year=

Rethinking data selection at scale: Random selection is almost all you need , author=. arXiv preprint arXiv:2410.09335 , year=

work page arXiv
[14]

arXiv preprint arXiv:2402.04333 , year=

Less: Selecting influential data for targeted instruction tuning , author=. arXiv preprint arXiv:2402.04333 , year=

work page arXiv
[15]

arXiv preprint arXiv:2402.11192 , year=

I learn better if you speak my language: Understanding the superior performance of fine-tuning large language models with LLM-generated responses , author=. arXiv preprint arXiv:2402.11192 , year=

work page arXiv
[16]

arXiv preprint arXiv:2207.06814 , year=

Bertin: Efficient pre-training of a spanish language model using perplexity sampling , author=. arXiv preprint arXiv:2207.06814 , year=

work page arXiv
[17]

arXiv preprint arXiv:2502.03387 , year=

Limo: Less is more for reasoning , author=. arXiv preprint arXiv:2502.03387 , year=

work page arXiv
[18]

The best instruction-tuning data are those that fit,

The best instruction-tuning data are those that fit , author=. arXiv preprint arXiv:2502.04194 , year=

work page arXiv
[19]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[20]

Proceedings of the 16th International Conference on Internetware , pages=

Adaptivellm: A framework for selecting optimal cost-efficient llm for code-generation based on cot length , author=. Proceedings of the 16th International Conference on Internetware , pages=
[21]

arXiv preprint arXiv:2505.03469 , year=

Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

work page arXiv
[22]

arXiv preprint arXiv:2509.20758 , year=

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs , author=. arXiv preprint arXiv:2509.20758 , year=

work page arXiv
[23]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. arXiv preprint arXiv:2403.13372 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

User-friendly Introduction to PAC-Bayes Bounds , volume=

Alquier, Pierre , year=. User-friendly Introduction to PAC-Bayes Bounds , volume=. Foundations and Trends in Machine Learning , publisher=. doi:10.1561/2200000100 , number=

work page doi:10.1561/2200000100
[25]

Information and Inference: A Journal of the IMA , volume=

The information complexity of learning tasks, their structure and their distance , author=. Information and Inference: A Journal of the IMA , volume=. 2021 , publisher=

2021
[26]

Proceedings of the 37th International Conference on Machine Learning , pages=

Improving generalization by controlling label-noise information in neural network weights , author=. Proceedings of the 37th International Conference on Machine Learning , pages=. 2020 , volume=

2020
[27]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Task2Vec: Task Embedding for Meta-Learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[28]

arXiv preprint arXiv:2106.07780 , year=

KL Guided Domain Adaptation , author=. arXiv preprint arXiv:2106.07780 , year=

work page arXiv
[29]

, title =

McAllester, David A. , title =. 1998 , isbn =. doi:10.1145/279943.279989 , booktitle =

work page doi:10.1145/279943.279989 1998
[30]

2004 , eprint=

A Note on the PAC Bayesian Theorem , author=. 2004 , eprint=

2004
[31]

Mathematical Association of America , year=

AIME 2024 Competition Mathematical Problems , author=. Mathematical Association of America , year=

2024
[32]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
[33]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
[34]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv