D³: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
Pith reviewed 2026-06-28 22:39 UTC · model grok-4.3
The pith
A dynamic graph of loss-based directional dependencies between training samples can constrain their order to improve LLM learning efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
D³ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, D³ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range.
What carries the argument
The dynamic influence graph whose edges are loss-based dependencies, used as the constraint in an optimization that produces the training sequence.
If this is right
- The constrained optimization produces a training order that improves learning efficiency relative to distribution-only baselines.
- The same framework delivers measurable gains in both pre-training and post-training regimes.
- An approximation algorithm keeps the added cost of building and solving the graph within a practical range.
- The method rests on a theoretical motivation linking graph-constrained order to information flow.
Where Pith is reading between the lines
- If the loss-based graph truly tracks influence, the same structure could be used to decide which samples to up-weight or discard mid-training rather than only their order.
- The dynamic update of the graph opens the possibility of re-computing the schedule at regular checkpoints without restarting the entire run.
- Hybrid systems that combine this ordering constraint with existing distribution re-balancing techniques may compound the observed gains.
Load-bearing premise
Real data samples have directional influences on each other that can be captured accurately enough by loss-based edges in a dynamic graph, and that reordering training to respect those edges improves outcomes.
What would settle it
Run identical pre-training and post-training experiments with the same total tokens but replace the D³-derived order by a random permutation of the same batches; if downstream metrics show no consistent difference, the central claim is falsified.
Figures
read the original abstract
Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes D³, a Dynamic Directional graph-constrained Data scheduling framework for LLM training. It models interactions among training samples as a dynamic influence graph whose edges encode loss-based dependencies, then solves a constrained optimization problem over this graph to produce a training order that respects evolving information flow. The work claims the method is theoretically motivated, delivers consistent gains over prior data scheduling approaches in both pre-training and post-training, and includes an efficient approximation algorithm to control overhead. Code is released at a public repository.
Significance. If the empirical claims hold, the approach would constitute a substantive contribution to data scheduling for LLMs by explicitly incorporating directional, loss-mediated dependencies between samples rather than treating samples as independent or only adjusting aggregate distributions. This could improve training efficiency in a manner complementary to existing curriculum or importance-sampling techniques. The release of code is a clear positive for reproducibility.
major comments (2)
- [Abstract] Abstract: the central claims of 'theoretical motivation' and 'consistent improvements' are asserted without any derivations, equations, experimental setup, baselines, or quantitative metrics. This absence is load-bearing because the soundness of the optimization and the magnitude of gains cannot be assessed from the provided text.
- [Abstract] Abstract: the key modeling assumption that real-world samples exhibit directional influences accurately captured by loss-based edges in a dynamic graph is stated but neither derived nor empirically justified in the visible material, leaving the motivation for the constrained optimization unsupported.
minor comments (1)
- [Abstract] Abstract: the description of the approximation algorithm is high-level; the manuscript should quantify its time complexity relative to the exact solver.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the opportunity to clarify points about the abstract. The abstract is a concise summary by design; the full manuscript contains the requested derivations, setups, and justifications. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 'theoretical motivation' and 'consistent improvements' are asserted without any derivations, equations, experimental setup, baselines, or quantitative metrics. This absence is load-bearing because the soundness of the optimization and the magnitude of gains cannot be assessed from the provided text.
Authors: Abstracts are intentionally limited in length and do not contain full derivations or tables. The theoretical motivation, including the loss-dependency graph formulation and constrained optimization, is derived with equations in Section 3. Experimental setups, baselines (curriculum learning, importance sampling, random ordering), and quantitative metrics (perplexity, downstream accuracy) appear in Sections 4 and 5, with tables showing consistent gains across pre-training and post-training. The full manuscript therefore supports assessment of soundness and effect sizes. revision: no
-
Referee: [Abstract] Abstract: the key modeling assumption that real-world samples exhibit directional influences accurately captured by loss-based edges in a dynamic graph is stated but neither derived nor empirically justified in the visible material, leaving the motivation for the constrained optimization unsupported.
Authors: The directional influence assumption is motivated in the Introduction and Section 2 by reference to observed training dynamics where earlier samples affect later loss. Section 3.1 formally defines the dynamic graph with loss-based edges and the optimization objective. Empirical support is given in Section 4.3 via graph visualizations, ablation on edge construction, and performance comparisons that validate the directional model over undirected or static alternatives. Theoretical justification for the optimization appears in the appendix. revision: no
Circularity Check
No significant circularity; derivation self-contained at described level
full rationale
The abstract and provided text describe D³ at a conceptual level: interactions among train-units are formulated as a dynamic influence graph with loss-based edges, followed by solving a constrained optimization for training order. No equations, derivations, fitted parameters, or self-citations appear in the given material. The claim of being 'theoretically motivated' is stated without any exhibited reduction to inputs by construction, self-definition, or load-bearing self-citation. Absent any mathematical chain that collapses to its own fitted quantities or prior author results, the approach remains self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Database-friendly random projections
Achlioptas, D. Database-friendly random projections. In Buneman, P. (ed.),Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, Cal- ifornia, USA. ACM,
2001
-
[2]
A survey on data selection for language models.arXiv preprint arXiv:2402.16827,
Albalak, A., Elazar, Y ., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., Raffel, C., Chang, S., Hashimoto, T., and Wang, W. Y . A survey on data selection for language models.arXiv preprint arXiv:2402.16827,
-
[3]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Efficient pretraining data selection for language models via multi-actor collaboration
Bai, T., Yang, L., et al. Efficient pretraining data selection for language models via multi-actor collaboration. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,
2025
-
[5]
Automixer: Checkpoint artifacts as automatic data mixers
Chang, E., Li, Y ., et al. Automixer: Checkpoint artifacts as automatic data mixers. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,
2025
-
[6]
Evaluating Large Language Models Trained on Code
Chen, M., Tworek, J., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021a. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Grattafiori, Aaron, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Training Compute-Optimal Large Language Models
Hoffmann, J., Borgeaud, S., Mensch, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes
Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes. InFind- ings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017,
2023
-
[13]
Jia, Y ., Zhang, C., Diao, X., Yuan, X., Ouyang, Z., and V osoughi, S. What makes a good curriculum? disentan- gling the effects of data ordering on llm mathematical reasoning.arXiv preprint arXiv:2510.19099,
-
[14]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
Kim, J. and Lee, J. Strategic data ordering: Enhancing large language model performance through curriculum learning.arXiv preprint arXiv:2405.07490,
-
[16]
Boosting multi-domain fine-tuning of large language models through evolving interactions between samples
Liang, X., Yang, L., Wang, J., Lu, Y ., Wu, R., Chen, H., and Hao, J. Boosting multi-domain fine-tuning of large language models through evolving interactions between samples. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19,
2025
-
[17]
9 Title Suppressed Due to Excessive Size Liu, H., Li, Z., Hall, D. L. W., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
2024
-
[18]
Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training
Luo, Z., Zhang, X., et al. Velocitune: A velocity-based dynamic domain reweighting method for continual pre- training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025,
2025
-
[19]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.arXiv preprint arXiv:1809.02789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., and Awadallah, A. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Sagawa, S., Koh, P. W., Hashimoto, T., and Liang, P. Distri- butionally robust neural networks for group shifts.arXiv preprint arXiv:1911.08731,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[23]
Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., and Ablin, P. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
-
[24]
Dynamic loss-based sample reweighting for improved large language model pretraining
Sow, D., Woisetschl¨ager, H., et al. Dynamic loss-based sample reweighting for improved large language model pretraining. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Sow, D., Woisetschl¨ager, H., Bulusu, S., Wang, S., Jacob- sen, H.-A., and Liang, Y . Dynamic loss-based sample reweighting for improved large lan...
2025
-
[25]
T., Wu, T., Song, D., Mittal, P., and Jia, R
Wang, J. T., Wu, T., Song, D., Mittal, P., and Jia, R. GREATS: online selection of high-quality data for LLM training in every iteration. InAdvances in Neural Infor- mation Processing Systems 38, NeurIPS 2024,
2024
-
[26]
A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima
Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021
-
[27]
Ye, J., Liu, P., Sun, T., Zhou, Y ., Zhan, J., and Qiu, X. Data mixing laws: Optimizing data mixtures by pre- dicting language modeling performance.arXiv preprint arXiv:2403.16952,
-
[28]
HellaSwag: Can a Machine Really Finish Your Sentence?
Yu, Y ., Han, K., Zhou, H., Tang, Y ., Huang, K., Wang, Y ., and Tao, D. LLM data selection and utilization via dynamic bi-level optimization. InForty-second Inter- national Conference on Machine Learning, ICML 2025, 2025a. Yu, Z., Peng, F., Lei, J., Overwijk, A., tau Yih, W., and Xiong, C. Group-level data selection for efficient pretrain- ing. InThe Thi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
R., Zhao, S., Song, K., Xu, S., and Zhu, C
Zhou, W., Agrawal, R., Zhang, S., Indurthi, S. R., Zhao, S., Song, K., Xu, S., and Zhu, C. WPO: enhancing RLHF with weighted preference optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 8328–8340. Association for Computational Linguistics,
2024
-
[30]
Complementing the internal PPL analysis, we further evaluate the model’s capabilities across several fundamental benchmarks for linguistic understanding and commonsense reasoning. Under a few-shot setting, we conduct evaluations on HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), OpenBookQA (Mihaylov et al., 2018), COPA (Sarlin et al., 2020), a...
2019
-
[31]
Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets
to evaluate functional correctness in programming. Table 4 provides a comprehensive summary of the statistics and characteristics of these evaluation datasets. (a) Token Frequency Word Cloud 0 500 1000 1500 2000 2500 3000 3500 Token Length (per sample) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Density ×10 3 Median: 470.0 (b) Token Length Distribution F...
2000
-
[32]
Table 6.Architectural configurations of the evaluated models
Our experiments cover two representative families: the GPT-2 series Medium, representing classic decoder-only Transformers with absolute positional embeddings, and the Llama series (1.1B), incorporating modern advancements such as Rotary Positional Embeddings (RoPE), SwiGLU activation functions, and Grouped-Query Attention. Table 6.Architectural configura...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.