D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

Guang Zhang; Jianing Hao; Yuanjian Xu; Zhong Li

arxiv: 2605.31164 · v1 · pith:MH2SB2Z7new · submitted 2026-05-29 · 💻 cs.CL · cs.AI

D³: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

Yuanjian Xu , Jianing Hao , Guang Zhang , Zhong Li This is my paper

Pith reviewed 2026-06-28 22:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords data schedulingLLM traininginfluence graphdynamic graphtraining orderconstrained optimizationpre-trainingpost-training

0 comments

The pith

A dynamic graph of loss-based directional dependencies between training samples can constrain their order to improve LLM learning efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that real data samples exert directional influences on one another during LLM training, so the sequence of presentation affects how efficiently the model learns. Existing schedulers adjust only the overall mix of data and therefore miss these pairwise effects. D³ constructs a dynamic influence graph whose edges encode loss-based dependencies, then solves a constrained optimization on that graph to produce a training order that follows the changing information flow. The resulting order yields consistent gains over prior methods in both pre-training and post-training while an approximation keeps the added cost bounded. A reader would care because any method that extracts more signal per token could reduce the scale of compute needed for capable models.

Core claim

D³ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, D³ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range.

What carries the argument

The dynamic influence graph whose edges are loss-based dependencies, used as the constraint in an optimization that produces the training sequence.

If this is right

The constrained optimization produces a training order that improves learning efficiency relative to distribution-only baselines.
The same framework delivers measurable gains in both pre-training and post-training regimes.
An approximation algorithm keeps the added cost of building and solving the graph within a practical range.
The method rests on a theoretical motivation linking graph-constrained order to information flow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the loss-based graph truly tracks influence, the same structure could be used to decide which samples to up-weight or discard mid-training rather than only their order.
The dynamic update of the graph opens the possibility of re-computing the schedule at regular checkpoints without restarting the entire run.
Hybrid systems that combine this ordering constraint with existing distribution re-balancing techniques may compound the observed gains.

Load-bearing premise

Real data samples have directional influences on each other that can be captured accurately enough by loss-based edges in a dynamic graph, and that reordering training to respect those edges improves outcomes.

What would settle it

Run identical pre-training and post-training experiments with the same total tokens but replace the D³-derived order by a random permutation of the same batches; if downstream metrics show no consistent difference, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.31164 by Guang Zhang, Jianing Hao, Yuanjian Xu, Zhong Li.

**Figure 1.** Figure 1: Overview of the D 3 scheduling framework. Given B micro-batches and the current parameter state θ (t) , the scheduler evaluates pairwise interactions by computing batch gradients and second-order curvature terms. These interactions are formalized into a directional influence matrix S (t) , which induces a dynamic influence graph. A graph-based solver then optimizes the sequence over this graph to produce t… view at source ↗

**Figure 2.** Figure 2: Performance-efficiency trade-off of various influence estimation methods on GPT-2 Medium trained on SlimPajama. The plot depicts the perplexity (y-axis) versus the relative training computation overhead (x-axis) compared to the baseline. A significant gap is observed between the Fisher Information Matrix (FIM) (Xie et al., 2021) and the Diagonal Hessian approach. While the Fisher method introduces a modera… view at source ↗

**Figure 3.** Figure 3: Ablation studies on approximation hyper-parameters. (a) Effect of the gradient projection dimension k on model performance. The red star identifies the optimal dimensionality (k ≈ 3500) for maximizing the signal-to-noise ratio in influence estimation, resulting in an 8.1% reduction in perplexity. (b) Impact of the scheduling window size L on convergence quality. Increasing the scheduling resolution enables… view at source ↗

**Figure 4.** Figure 4: Data statistics of the SlimPajama validation set used for PPL evaluation. (a) Token Frequency Word Cloud: visualizing common terms within the corpus. (b) Token Length Distribution: the samples follow a right-skewed distribution with a mean length of 1010.2 and a median of 470.0 tokens [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Mechanistic illustration of D 3 scheduling on a non-transitive influence graph. Nodes A, B, and C represent distinct training samples (or mini-batches), while directed edges quantify the pairwise directional advantages Sij derived from gradient alignment. The depicted cyclic dependency (A → B → C → A) represents a fundamental scheduling conflict; D 3 resolves this by identifying the permutation that minimi… view at source ↗

read the original abstract

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D3 sketches a directional graph scheduler for LLM data order but the abstract supplies no equations, experiments, or numbers, leaving the claims impossible to assess.

read the letter

The core claim here is a new scheduling method that builds a dynamic directional graph from loss-based sample dependencies and then runs constrained optimization to pick the training sequence. The paper positions this as filling a gap left by prior work that only tweaks overall data distributions.

The graph formulation with loss edges and the focus on directional influence is the clearest new piece. Treating training order as an evolving information-flow problem rather than a static distribution problem is a reasonable reframing, and the mention of an approximation algorithm shows some attention to making it usable at scale.

Beyond that the abstract is thin. No derivations appear, no experimental protocol is described, no baselines or metrics are given, and the “consistent improvements” are stated without numbers or error bars. The central assumption—that loss-based edges reliably encode directional influences that matter for learning—remains unexamined in the provided text. Without seeing how the graph is actually constructed or whether the optimization produces measurable gains over simpler heuristics, it is hard to know whether the method adds anything beyond added complexity.

This is aimed at researchers already working on data-ordering or curriculum methods for large models. Someone looking for fresh ideas on reducing training cost through ordering might skim it for the graph angle, but they would need the full paper and its results before deciding whether to pursue the approach.

I would send it to peer review. The idea is distinct enough from standard distribution-based schedulers that referees could usefully test whether the graph model delivers real gains or collapses to something simpler.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes D³, a Dynamic Directional graph-constrained Data scheduling framework for LLM training. It models interactions among training samples as a dynamic influence graph whose edges encode loss-based dependencies, then solves a constrained optimization problem over this graph to produce a training order that respects evolving information flow. The work claims the method is theoretically motivated, delivers consistent gains over prior data scheduling approaches in both pre-training and post-training, and includes an efficient approximation algorithm to control overhead. Code is released at a public repository.

Significance. If the empirical claims hold, the approach would constitute a substantive contribution to data scheduling for LLMs by explicitly incorporating directional, loss-mediated dependencies between samples rather than treating samples as independent or only adjusting aggregate distributions. This could improve training efficiency in a manner complementary to existing curriculum or importance-sampling techniques. The release of code is a clear positive for reproducibility.

major comments (2)

[Abstract] Abstract: the central claims of 'theoretical motivation' and 'consistent improvements' are asserted without any derivations, equations, experimental setup, baselines, or quantitative metrics. This absence is load-bearing because the soundness of the optimization and the magnitude of gains cannot be assessed from the provided text.
[Abstract] Abstract: the key modeling assumption that real-world samples exhibit directional influences accurately captured by loss-based edges in a dynamic graph is stated but neither derived nor empirically justified in the visible material, leaving the motivation for the constrained optimization unsupported.

minor comments (1)

[Abstract] Abstract: the description of the approximation algorithm is high-level; the manuscript should quantify its time complexity relative to the exact solver.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify points about the abstract. The abstract is a concise summary by design; the full manuscript contains the requested derivations, setups, and justifications. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'theoretical motivation' and 'consistent improvements' are asserted without any derivations, equations, experimental setup, baselines, or quantitative metrics. This absence is load-bearing because the soundness of the optimization and the magnitude of gains cannot be assessed from the provided text.

Authors: Abstracts are intentionally limited in length and do not contain full derivations or tables. The theoretical motivation, including the loss-dependency graph formulation and constrained optimization, is derived with equations in Section 3. Experimental setups, baselines (curriculum learning, importance sampling, random ordering), and quantitative metrics (perplexity, downstream accuracy) appear in Sections 4 and 5, with tables showing consistent gains across pre-training and post-training. The full manuscript therefore supports assessment of soundness and effect sizes. revision: no
Referee: [Abstract] Abstract: the key modeling assumption that real-world samples exhibit directional influences accurately captured by loss-based edges in a dynamic graph is stated but neither derived nor empirically justified in the visible material, leaving the motivation for the constrained optimization unsupported.

Authors: The directional influence assumption is motivated in the Introduction and Section 2 by reference to observed training dynamics where earlier samples affect later loss. Section 3.1 formally defines the dynamic graph with loss-based edges and the optimization objective. Empirical support is given in Section 4.3 via graph visualizations, ablation on edge construction, and performance comparisons that validate the directional model over undirected or static alternatives. Theoretical justification for the optimization appears in the appendix. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained at described level

full rationale

The abstract and provided text describe D³ at a conceptual level: interactions among train-units are formulated as a dynamic influence graph with loss-based edges, followed by solving a constrained optimization for training order. No equations, derivations, fitted parameters, or self-citations appear in the given material. The claim of being 'theoretically motivated' is stated without any exhibited reduction to inputs by construction, self-definition, or load-bearing self-citation. Absent any mathematical chain that collapses to its own fitted quantities or prior author results, the approach remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full manuscript details unavailable for ledger construction.

pith-pipeline@v0.9.1-grok · 5746 in / 1048 out tokens · 30292 ms · 2026-06-28T22:39:27.353694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 18 canonical work pages · 13 internal anchors

[1]

Database-friendly random projections

Achlioptas, D. Database-friendly random projections. In Buneman, P. (ed.),Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, Cal- ifornia, USA. ACM,

2001
[2]

A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

Albalak, A., Elazar, Y ., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., Raffel, C., Chang, S., Hashimoto, T., and Wang, W. Y . A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

work page arXiv
[3]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Efficient pretraining data selection for language models via multi-actor collaboration

Bai, T., Yang, L., et al. Efficient pretraining data selection for language models via multi-actor collaboration. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025
[5]

Automixer: Checkpoint artifacts as automatic data mixers

Chang, E., Li, Y ., et al. Automixer: Checkpoint artifacts as automatic data mixers. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025
[6]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021a. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Llama 3 Herd of Models

Grattafiori, Aaron, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes

Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes. InFind- ings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017,

2023
[13]

What makes a good curriculum? disentan- gling the effects of data ordering on llm mathematical reasoning.arXiv preprint arXiv:2510.19099,

Jia, Y ., Zhang, C., Diao, X., Yuan, X., Ouyang, Z., and V osoughi, S. What makes a good curriculum? disentan- gling the effects of data ordering on llm mathematical reasoning.arXiv preprint arXiv:2510.19099,

work page arXiv
[14]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

and Lee, J

Kim, J. and Lee, J. Strategic data ordering: Enhancing large language model performance through curriculum learning.arXiv preprint arXiv:2405.07490,

work page arXiv
[16]

Boosting multi-domain fine-tuning of large language models through evolving interactions between samples

Liang, X., Yang, L., Wang, J., Lu, Y ., Wu, R., Chen, H., and Hao, J. Boosting multi-domain fine-tuning of large language models through evolving interactions between samples. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19,

2025
[17]

9 Title Suppressed Due to Excessive Size Liu, H., Li, Z., Hall, D. L. W., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[18]

Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training

Luo, Z., Zhang, X., et al. Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025
[19]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., and Awadallah, A. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Instruction Tuning with GPT-4

Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa, S., Koh, P. W., Hashimoto, T., and Liang, P. Distri- butionally robust neural networks for group shifts.arXiv preprint arXiv:1911.08731,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[23]

Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., and Ablin, P. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

work page arXiv
[24]

Dynamic loss-based sample reweighting for improved large language model pretraining

Sow, D., Woisetschl¨ager, H., et al. Dynamic loss-based sample reweighting for improved large language model pretraining. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Sow, D., Woisetschl¨ager, H., Bulusu, S., Wang, S., Jacob- sen, H.-A., and Liang, Y . Dynamic loss-based sample reweighting for improved large lan...

2025
[25]

T., Wu, T., Song, D., Mittal, P., and Jia, R

Wang, J. T., Wu, T., Song, D., Mittal, P., and Jia, R. GREATS: online selection of high-quality data for LLM training in every iteration. InAdvances in Neural Infor- mation Processing Systems 38, NeurIPS 2024,

2024
[26]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

2021
[27]

Data mixing laws: Optimizing data mixtures by pre- dicting language modeling performance.arXiv preprint arXiv:2403.16952,

Ye, J., Liu, P., Sun, T., Zhou, Y ., Zhan, J., and Qiu, X. Data mixing laws: Optimizing data mixtures by pre- dicting language modeling performance.arXiv preprint arXiv:2403.16952,

work page arXiv
[28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Yu, Y ., Han, K., Zhou, H., Tang, Y ., Huang, K., Wang, Y ., and Tao, D. LLM data selection and utilization via dynamic bi-level optimization. InForty-second Inter- national Conference on Machine Learning, ICML 2025, 2025a. Yu, Z., Peng, F., Lei, J., Overwijk, A., tau Yih, W., and Xiong, C. Group-level data selection for efficient pretrain- ing. InThe Thi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

R., Zhao, S., Song, K., Xu, S., and Zhu, C

Zhou, W., Agrawal, R., Zhang, S., Indurthi, S. R., Zhao, S., Song, K., Xu, S., and Zhu, C. WPO: enhancing RLHF with weighted preference optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 8328–8340. Association for Computational Linguistics,

2024
[30]

Complementing the internal PPL analysis, we further evaluate the model’s capabilities across several fundamental benchmarks for linguistic understanding and commonsense reasoning. Under a few-shot setting, we conduct evaluations on HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), OpenBookQA (Mihaylov et al., 2018), COPA (Sarlin et al., 2020), a...

2019
[31]

Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets

to evaluate functional correctness in programming. Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets. (a) Token Frequency Word Cloud 0 500 1000 1500 2000 2500 3000 3500 Token Length (per sample) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Density ×10 3 Median: 470.0 (b) Token Length Distribution F...

2000
[32]

Table 6.Architectural configurations of the evaluated models

Our experiments cover two representative families: the GPT-2 series Medium, representing classic decoder-only Transformers with absolute positional embeddings, and the Llama series (1.1B), incorporating modern advancements such as Rotary Positional Embeddings (RoPE), SwiGLU activation functions, and Grouped-Query Attention. Table 6.Architectural configura...

2048

[1] [1]

Database-friendly random projections

Achlioptas, D. Database-friendly random projections. In Buneman, P. (ed.),Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, Cal- ifornia, USA. ACM,

2001

[2] [2]

A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

Albalak, A., Elazar, Y ., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., Raffel, C., Chang, S., Hashimoto, T., and Wang, W. Y . A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

work page arXiv

[3] [3]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Efficient pretraining data selection for language models via multi-actor collaboration

Bai, T., Yang, L., et al. Efficient pretraining data selection for language models via multi-actor collaboration. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025

[5] [5]

Automixer: Checkpoint artifacts as automatic data mixers

Chang, E., Li, Y ., et al. Automixer: Checkpoint artifacts as automatic data mixers. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025

[6] [6]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021a. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., ...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Llama 3 Herd of Models

Grattafiori, Aaron, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes

Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes. InFind- ings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017,

2023

[13] [13]

What makes a good curriculum? disentan- gling the effects of data ordering on llm mathematical reasoning.arXiv preprint arXiv:2510.19099,

Jia, Y ., Zhang, C., Diao, X., Yuan, X., Ouyang, Z., and V osoughi, S. What makes a good curriculum? disentan- gling the effects of data ordering on llm mathematical reasoning.arXiv preprint arXiv:2510.19099,

work page arXiv

[14] [14]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[15] [15]

and Lee, J

Kim, J. and Lee, J. Strategic data ordering: Enhancing large language model performance through curriculum learning.arXiv preprint arXiv:2405.07490,

work page arXiv

[16] [16]

Boosting multi-domain fine-tuning of large language models through evolving interactions between samples

Liang, X., Yang, L., Wang, J., Lu, Y ., Wu, R., Chen, H., and Hao, J. Boosting multi-domain fine-tuning of large language models through evolving interactions between samples. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19,

2025

[17] [17]

9 Title Suppressed Due to Excessive Size Liu, H., Li, Z., Hall, D. L. W., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024

[18] [18]

Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training

Luo, Z., Zhang, X., et al. Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,

2025

[19] [19]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., and Awadallah, A. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Instruction Tuning with GPT-4

Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa, S., Koh, P. W., Hashimoto, T., and Liang, P. Distri- butionally robust neural networks for group shifts.arXiv preprint arXiv:1911.08731,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[23] [23]

Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., and Ablin, P. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

work page arXiv

[24] [24]

Dynamic loss-based sample reweighting for improved large language model pretraining

Sow, D., Woisetschl¨ager, H., et al. Dynamic loss-based sample reweighting for improved large language model pretraining. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Sow, D., Woisetschl¨ager, H., Bulusu, S., Wang, S., Jacob- sen, H.-A., and Liang, Y . Dynamic loss-based sample reweighting for improved large lan...

2025

[25] [25]

T., Wu, T., Song, D., Mittal, P., and Jia, R

Wang, J. T., Wu, T., Song, D., Mittal, P., and Jia, R. GREATS: online selection of high-quality data for LLM training in every iteration. InAdvances in Neural Infor- mation Processing Systems 38, NeurIPS 2024,

2024

[26] [26]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

2021

[27] [27]

Data mixing laws: Optimizing data mixtures by pre- dicting language modeling performance.arXiv preprint arXiv:2403.16952,

Ye, J., Liu, P., Sun, T., Zhou, Y ., Zhan, J., and Qiu, X. Data mixing laws: Optimizing data mixtures by pre- dicting language modeling performance.arXiv preprint arXiv:2403.16952,

work page arXiv

[28] [28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Yu, Y ., Han, K., Zhou, H., Tang, Y ., Huang, K., Wang, Y ., and Tao, D. LLM data selection and utilization via dynamic bi-level optimization. InForty-second Inter- national Conference on Machine Learning, ICML 2025, 2025a. Yu, Z., Peng, F., Lei, J., Overwijk, A., tau Yih, W., and Xiong, C. Group-level data selection for efficient pretrain- ing. InThe Thi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

R., Zhao, S., Song, K., Xu, S., and Zhu, C

Zhou, W., Agrawal, R., Zhang, S., Indurthi, S. R., Zhao, S., Song, K., Xu, S., and Zhu, C. WPO: enhancing RLHF with weighted preference optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 8328–8340. Association for Computational Linguistics,

2024

[30] [30]

Complementing the internal PPL analysis, we further evaluate the model’s capabilities across several fundamental benchmarks for linguistic understanding and commonsense reasoning. Under a few-shot setting, we conduct evaluations on HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), OpenBookQA (Mihaylov et al., 2018), COPA (Sarlin et al., 2020), a...

2019

[31] [31]

Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets

to evaluate functional correctness in programming. Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets. (a) Token Frequency Word Cloud 0 500 1000 1500 2000 2500 3000 3500 Token Length (per sample) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Density ×10 3 Median: 470.0 (b) Token Length Distribution F...

2000

[32] [32]

Table 6.Architectural configurations of the evaluated models

Our experiments cover two representative families: the GPT-2 series Medium, representing classic decoder-only Transformers with absolute positional embeddings, and the Llama series (1.1B), incorporating modern advancements such as Rotary Positional Embeddings (RoPE), SwiGLU activation functions, and Grouped-Query Attention. Table 6.Architectural configura...

2048