Towards Efficient LLMs Annealing with Principled Sample Selection

Guang Zhang; Jianing Hao; Wanbo Zhang; Yuanjian Xu; Zhong Li

arxiv: 2605.31175 · v1 · pith:LH5DC6GMnew · submitted 2026-05-29 · 💻 cs.CL

Towards Efficient LLMs Annealing with Principled Sample Selection

Yuanjian Xu , Jianing Hao , Wanbo Zhang , Zhong Li , Guang Zhang This is my paper

Pith reviewed 2026-06-28 22:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM pre-trainingannealing phasesample selectionloss landscapeconstrained optimizationHessiangradient constraintsDiReCT

0 comments

The pith

DiReCT selects LLM annealing samples by aligning per-sample gradients with Hessian eigen-direction constraints for optimal convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the annealing phase of LLM pre-training through the spectral geometry of the loss landscape. It argues that optimal convergence requires gradient updates to satisfy heterogeneous directional constraints across different eigen-directions. Building on this, the work formulates sample selection as a constrained optimization problem. DiReCT solves it by imposing explicit directional constraints on gradients based on Hessian spectral properties to identify samples aligned with the curvature-aware descent path. Experiments across model scales show this approach delivers state-of-the-art results where prior heuristic methods fall short.

Core claim

DiReCT reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance.

What carries the argument

DiReCT (Directionally-Restrained Constrained Training), a framework that reformulates data selection by imposing directional constraints on per-sample gradients drawn from the Hessian's eigen-directions.

Load-bearing premise

Optimal convergence in the annealing phase requires gradient updates to satisfy heterogeneous directional constraints across eigen-directions of the loss landscape.

What would settle it

A controlled run on the same models and annealing data where random or heuristic selection matches or exceeds DiReCT performance on final model quality metrics would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.31175 by Guang Zhang, Jianing Hao, Wanbo Zhang, Yuanjian Xu, Zhong Li.

**Figure 1.** Figure 1: A geometric comparison of optimization dynamics across training phases. In the stable training stage (a), high learning rates and uncurated data lead to high-variance oscillations across the steep walls of the loss landscape, resulting in a weak descent signal. During the annealing phase (b), our proposed mechanism suppresses transverse noise while maximizing the longitudinal signal, allowing the model to … view at source ↗

**Figure 2.** Figure 2: Hessian Eigenvalue and Eigenvector Analysis for GPT-2-Medium. (a) The eigenvalue spectrum exhibits a sharp power-law decay, where the green region denotes the stiff subspace and the orange region represents the flat subspace. (b) PCA projection of eigenvectors, illustrating the directional concentration of high-energy components versus the diffusion of low-energy ones. (c) Spectral elbow detection based on… view at source ↗

**Figure 3.** Figure 3: Sample selection under two extreme regimes. We probe DiReCT’s selection behavior on the constructed annealing dataset under two extremes: (a) the high-loss extreme, showing the pre-training loss distribution of selected versus unselected samples; and (b) the short-length extreme, showing the sequence length distribution. DiReCT prioritizes high-loss, long-sequence samples that align with the flat curvature… view at source ↗

**Figure 4.** Figure 4: Overview of the annealing-phase dataset. (a) Sample distribution across sources: MathPile, StarCoderData, and the Dolma 3 Longmino Mix. (b) Train/validation split (90%/10%). (c) On-disk file sizes for train, validation, and total corpus. A.2. Implementation Details Baselines. All baselines retain the top 80% of the training set Dtrain ranked by a method-specific score, except Uniform Sampling, which draws … view at source ↗

read the original abstract

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiReCT's constrained-optimization framing for annealing data selection rests on an asserted link from Hessian geometry to per-eigen-direction constraints that is never derived.

read the letter

The one thing to know is that the paper's central theoretical move—claiming optimal annealing requires gradient updates to meet heterogeneous directional constraints from the Hessian spectrum—gets stated as an argument rather than shown as a derivation or theorem. That step is load-bearing for the whole DiReCT claim.

What the work actually does is reformulate sample selection as a constrained optimization problem that picks examples whose per-sample gradients align with those directions. It then reports experiments across model scales where the method beats common heuristics like domain filtering. The experiments are the concrete part; if the gains hold up under the usual controls for compute and data volume, that is usable information for people doing the final stage of pre-training.

The soft spot is exactly the missing derivation. The abstract and the framing treat the need for per-eigen-direction constraints as obvious from spectral geometry, but nothing connects the loss-landscape properties to the specific optimization problem they solve. Without that, the method is closer to a new heuristic dressed in curvature language than a principled replacement. There is also no discussion of how the Hessian is estimated at scale or whether the constraints are computed on held-out data, which leaves the circularity question open.

This is for labs already running large annealing runs and looking for better data filters. A reader who wants another empirical data-selection trick might find the results worth checking; someone looking for a clean theoretical reduction will not. The paper shows clear engagement with the practical problem and the optimization literature, so it is coherent on its own terms even if the main claim does not close.

I would send it to peer review. The empirical side is testable and the topic matters; referees can press on the derivation and the implementation details.

Referee Report

2 major / 0 minor

Summary. The manuscript characterizes the LLM annealing phase via the loss landscape's spectral geometry and argues that optimal convergence requires gradient updates satisfying heterogeneous directional constraints across eigen-directions. It formulates sample selection as a constrained optimization problem and introduces DiReCT, which imposes explicit directional constraints on per-sample gradients derived from Hessian spectral properties to identify samples aligning with an optimal curvature-aware descent path. Experiments across model scales are reported to show consistent state-of-the-art performance, with code released.

Significance. If the mapping from spectral geometry to the specific constrained formulation can be rigorously derived and the empirical gains hold under controlled comparisons, the work would supply a theory-grounded alternative to heuristic data-selection practices in the critical annealing stage, with potential impact on training efficiency and final model quality. The public code release supports reproducibility.

major comments (2)

[Abstract] Abstract (paragraph beginning 'We argue that optimal convergence...'): The central premise that optimal convergence requires gradient updates to satisfy heterogeneous directional constraints across eigen-directions is asserted without a derivation from loss-landscape geometry or a theorem establishing necessity or sufficiency of per-eigen-direction constraints in the annealing regime. This step is load-bearing for the subsequent constrained-optimization formulation and the claim that DiReCT identifies the 'optimal curvature-aware descent path'.
[Abstract] Abstract (final experimental claim): The statement that DiReCT 'consistently achieves state-of-the-art performance' across model scales is presented without reference to specific baselines, metrics, ablation controls, or statistical significance tests in the provided text, preventing assessment of whether gains are attributable to the directional constraints rather than implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph beginning 'We argue that optimal convergence...'): The central premise that optimal convergence requires gradient updates to satisfy heterogeneous directional constraints across eigen-directions is asserted without a derivation from loss-landscape geometry or a theorem establishing necessity or sufficiency of per-eigen-direction constraints in the annealing regime. This step is load-bearing for the subsequent constrained-optimization formulation and the claim that DiReCT identifies the 'optimal curvature-aware descent path'.

Authors: We acknowledge that the abstract asserts the premise without an explicit derivation. The full manuscript motivates the constraints via the loss landscape's spectral geometry, but a formal step-by-step derivation from the Hessian eigenspectrum to the per-eigen-direction constraints is not present. In the revision we will add a short derivation subsection (or paragraph in the introduction) establishing necessity under standard smoothness and curvature assumptions for the annealing regime. revision: yes
Referee: [Abstract] Abstract (final experimental claim): The statement that DiReCT 'consistently achieves state-of-the-art performance' across model scales is presented without reference to specific baselines, metrics, ablation controls, or statistical significance tests in the provided text, preventing assessment of whether gains are attributable to the directional constraints rather than implementation details.

Authors: The abstract is intentionally concise. The full paper reports comparisons against standard baselines (random, perplexity, domain filtering), metrics (validation perplexity and downstream tasks), ablations on constraint parameters, and results from multiple random seeds with significance testing. We will revise the abstract to include a brief parenthetical reference to these elements and the relevant experimental section. revision: yes

Circularity Check

1 steps flagged

Optimal descent path defined by the directional constraints that DiReCT itself imposes

specific steps

self definitional [Abstract]
"We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. [...] By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path."

The 'optimal' path is defined as the trajectory obeying the heterogeneous directional constraints; DiReCT is then presented as identifying samples that satisfy those same constraints, rendering the alignment claim true by construction rather than by independent derivation from spectral geometry.

full rationale

The paper's central theoretical step asserts without derivation that optimal convergence in annealing requires gradient updates to obey heterogeneous eigen-direction constraints, then defines DiReCT as the method that enforces exactly those constraints and therefore aligns with the 'optimal curvature-aware descent path.' This reduces the claim of principled grounding to a self-definitional equivalence: the target optimality is characterized by the same per-sample Hessian-based directional restrictions the algorithm applies. No independent theorem or external loss-landscape result is shown to establish necessity or sufficiency of those constraints.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete equations or experimental sections, so no specific free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5728 in / 1088 out tokens · 21129 ms · 2026-06-28T22:34:37.350783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 15 internal anchors

[1]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. GPT-NeoX-20B: An Open-Source Autoregres- sive Language Model.arXiv preprint arXiv:2204.06745,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Chal- lenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-V3 Technical Report

DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P., Doll ´ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y ., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

X., and Wen, J

Hu, Y ., Song, H., Deng, J., Wang, J., Chen, J., Zhou, K., Zhu, Y ., Jiang, J., Dong, Z., Zhao, W. X., and Wen, J. Yulan- mini: An open data-efficient language model.arXiv preprint arXiv:2412.17743,

work page arXiv
[10]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Team, K., Du, A., Gao, B., Xing, B., et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Crowdsourcing Multiple Choice Science Questions

Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing Multiple Choice Science Questions.arXiv preprint arXiv:1707.06209,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . HellaSwag: Can a Machine Really Finish Your Sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[16]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Zeng, A., Xu, B., Wang, B., et al. Chatglm: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

is run with maximum iteration countT max = 20and convergence toleranceδ= 10 −4. B. Successive Convex Approximation Solver Our goal is to learn a continuous selection vector w∈[0,1] N such that the selected samples produce a strong aggregate gradient in the flat subspace while remaining within a budget in the stiff subspace. Because the flat-subspace objec...

1995

[1] [1]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. GPT-NeoX-20B: An Open-Source Autoregres- sive Language Model.arXiv preprint arXiv:2204.06745,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Chal- lenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-V3 Technical Report

DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P., Doll ´ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y ., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

X., and Wen, J

Hu, Y ., Song, H., Deng, J., Wang, J., Chen, J., Zhou, K., Zhu, Y ., Jiang, J., Dong, Z., Zhao, W. X., and Wen, J. Yulan- mini: An open data-efficient language model.arXiv preprint arXiv:2412.17743,

work page arXiv

[10] [10]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Team, K., Du, A., Gao, B., Xing, B., et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Crowdsourcing Multiple Choice Science Questions

Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing Multiple Choice Science Questions.arXiv preprint arXiv:1707.06209,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . HellaSwag: Can a Machine Really Finish Your Sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[16] [16]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Zeng, A., Xu, B., Wang, B., et al. Chatglm: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

is run with maximum iteration countT max = 20and convergence toleranceδ= 10 −4. B. Successive Convex Approximation Solver Our goal is to learn a continuous selection vector w∈[0,1] N such that the selected samples produce a strong aggregate gradient in the flat subspace while remaining within a budget in the stiff subspace. Because the flat-subspace objec...

1995