pith. sign in

arxiv: 2606.05186 · v1 · pith:W2F3KLRMnew · submitted 2026-04-27 · 💻 cs.LG · cs.CL

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

Pith reviewed 2026-07-01 08:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords micro-pretrainingfactorial screeningbudget-constrained searchstaged experimentseffect estimationanchor confirmationhyperparameter triage
0
0 comments X

The pith

Staged fractional-factorial screens identify high-penalty directions early and support bridge anchors through 24 hours on two hosts in budget-constrained micro-pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a staged workflow of short fractional-factorial screens can recover stable early effect estimates to triage candidate training recipes when accelerator time is limited. Hundreds of runs at 2 to 10 minutes on a fixed single-GPU loop show that penalties from total batch size, depth, and width are largest at the shortest durations and weaken as time increases. After correction, several factors retain detectable effects at 5 and 10 minutes while one does not, and random search finds competitive points but clusters in low-penalty regions without explaining why. Longer bridge runs at 60 minutes and seeded 12- and 24-hour continuations on two hosts place the bridge package lowest in mean loss, though the ordering of non-bridge options varies by host. The result is a bounded recommendation to screen briefly, confirm anchors repeatedly, and refine inside the reduced space rather than claim hardware-invariant rankings or broad hyperparameter-optimization superiority.

Core claim

On a fixed autoresearch-derived single-GPU training loop, 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes, full 16-condition seeded reruns, targeted anchor checks, same-host baselines, a 60-minute bridge package, and bounded 12- and 24-hour three-anchor continuations on Windows A100 and Linux L40S hosts show that main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, factors D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction while E does not. Random search reaches strong incumbents in the 32-condition

What carries the argument

staged fractional-factorial workflow that runs short designed screens to estimate and remove high-penalty directions before committing longer budgets to confirmation and local refinement

If this is right

  • Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases.
  • Within predeclared seeded full-screen families, factors D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction while E does not.
  • Random search reaches strong incumbents in the 32-condition space but repeatedly in the same low-penalty region and without factor attribution.
  • The 60-minute bridge package has the lowest mean, and in bounded 12- and 24-hour three-anchor continuations on both hosts the bridge has the lowest sample mean while non-bridge ordering stays host-sensitive.
  • The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed relaxation of penalties with budget suggests that early screens could be reused across multiple model scales if the factor set is held constant.
  • Host-sensitive ordering at 24 hours implies that separate short screens per hardware class may be required for stable recommendations rather than a single universal ranking.
  • Because random search already locates strong points inside the low-penalty region, the added value of the factorial workflow lies mainly in the attribution step that guides later refinement.
  • Extending the same staged design to include interaction terms or additional factors such as optimizer choice would test whether the current main-effects focus remains sufficient at longer budgets.

Load-bearing premise

The autoresearch-derived single-GPU training loop and predeclared factor set recover stable early effect structure that remains informative when budget increases to 60 minutes and 24 hours.

What would settle it

If 24-hour repeated runs on both hosts show any non-bridge anchor achieving a lower sample mean than the bridge anchor, or if short-screen effect estimates fail to predict the ordering observed at 60 minutes and beyond, the bridge-centered recommendation would not hold.

Figures

Figures reproduced from arXiv: 2606.05186 by Felipe Chavarro Polania.

Figure 1
Figure 1. Figure 1: Main-effect penalties relax with budget. The [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: In the seeded confirmation subset, variance is dominated by budget and condition struc [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The 60-minute bridge package keeps the reduced-space bridge best and the predeclared control worst. Points show seed runs; black markers show means with 95% confidence intervals. role 10 min mean 60 min mean 95% CI Descriptive cross-seed win counts in the same package: • bridge_best < greedy: 16/16 • screened_best < greedy: 1/16 • bridge_best < control: 16/16 • greedy < control: 16/16 • bridge_best - greed… view at source ↗
Figure 4
Figure 4. Figure 4: Dual-host long-horizon anchor packages at [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that a staged fractional-factorial screening workflow can recover stable early effect structure for budget-constrained micro-pretraining on a fixed single-GPU autoresearch loop. Across 613 experiments (pilot/follow-up screens at 2/5/10 min, full 16-condition reruns, anchor checks, greedy/random baselines, 60-min bridge, and 12/24-hour continuations on Windows A100 and Linux L40S), main penalties from total batch, depth, and width are largest at short budgets and relax later; within predeclared seeded families, factors D/A/B/C retain non-zero estimates after within-budget BH correction while E does not. Random search reaches strong incumbents but without attribution. The 60-min bridge shows lowest mean (though confounded with capacity), and 24-hour three-anchor runs favor the bridge on both hosts while non-bridge ordering is host-sensitive. The central recommendation is therefore bounded: use short designed screens to identify high-penalty directions, confirm anchors under repetition, and refine locally in the reduced space.

Significance. If the bounded empirical findings hold, the work supplies a practical, statistically controlled method for triaging candidate recipes under tight accelerator budgets before committing larger resources. Strengths include the explicit framing as a methods result tied to a fixed training loop and predeclared factors, the use of repeated anchor runs and within-budget multiple-testing correction, and direct probing of effect stability via the 60-min bridge and 24-hour continuations. It does not claim hardware-invariant rankings or general HPO superiority.

minor comments (3)
  1. [Abstract, §3] Abstract and §3: the exact definitions and level settings for the five predeclared factors (A–E) are referenced but not enumerated in the provided text; a concise table or appendix listing them would improve reproducibility.
  2. [§4, Table 2] §4 and Table 2: the precise implementation of the within-budget Benjamini-Hochberg correction (e.g., how p-values are pooled across the 5- and 10-minute screens) is described at high level; an explicit formula or pseudocode step would clarify whether the reported non-zero estimates for D/A/B/C survive the exact procedure used.
  3. [Figure 3, §5.2] Figure 3 and §5.2: the 24-hour continuation plots show host-sensitive ordering for non-bridge anchors, but axis scaling and error-bar conventions differ slightly between hosts; uniform formatting would aid direct visual comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the clear summary of the manuscript, the positive assessment of its scope and limitations, and the recommendation for minor revision. No specific major comments are enumerated in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical results self-contained

full rationale

The manuscript reports results from a fixed experimental loop of 613 runs using predeclared factors, short screens, BH correction, repeated anchors, and bounded continuations. No equations, derivations, or predictions appear; all claims reduce to direct sample means and within-budget corrections on the observed data. No self-citations are load-bearing for the central bounded-methods recommendation. The work is self-contained against the described training loop and does not invoke fitted parameters renamed as predictions or uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only; ledger populated from stated experimental assumptions. Relies on standard statistical assumptions for factorial designs and multiple testing. No invented entities. Free parameters are the five labeled factors whose levels are chosen by the authors.

free parameters (1)
  • Factor levels for A, B, C, D, E
    Specific values of batch, depth, width and two additional factors chosen for the 32-condition space; fitted or selected to define the screen.
axioms (2)
  • domain assumption Fractional factorial design recovers stable early effect structure in autoregressive pretraining loops
    Invoked when claiming that short-budget penalties remain informative at longer budgets.
  • standard math Benjamini-Hochberg correction controls false discoveries within predeclared seeded families
    Used to declare non-zero estimates for D, A, B, C at 5 and 10 minutes.

pith-pipeline@v0.9.1-grok · 5845 in / 1553 out tokens · 36923 ms · 2026-07-01T08:36:58.839578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

    cs.CL 2026-06 unverdicted novelty 4.0

    Case study applies frozen staged budgets and promotion rules to twelve micro-pretraining configurations, identifying a top bridge condition at 12 hours with 169 GPU-hours total versus higher counterfactual costs.

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    GitHub repository

    Andrej Karpathy.autoresearch. GitHub repository. Available at:https://github.com/ karpathy/autoresearch

  2. [2]

    SkyPilot documentation.Parallel autoresearch. Available at:https://docs.skypilot.co/ en/latest/examples/agents/autoresearch.html [3]Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Lan- guage Model Pretraining. arXiv:2503.04715.https://arxiv.org/abs/2503.04715 [4]Principled Architecture-aware Scaling of Hyperparameters. a...

  3. [3]

    Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

    James Bergstra and Yoshua Bengio.Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

  4. [4]

    Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance

    Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance. Proceedings of the 31st International Conference on Machine Learning, PMLR 32, 2014.https://proceedings.mlr.press/v32/hutter14.html

  5. [5]

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1-52, 2018.https://jmlr.org/beta/papers/v18/16-558.html

  6. [6]

    Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

    Stefan Falkner, Aaron Klein, and Frank Hutter.BOHB: Robust and Efficient Hyperparam- eter Optimization at Scale. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

  7. [7]

    Adams.Practical Bayesian Optimization of Machine Learning Algorithms

    Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems 25, 2012. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract. html

  8. [8]

    Montgomery.Design and Analysis of Experiments

    Douglas C. Montgomery.Design and Analysis of Experiments. Wiley, 10th edition, 2019

  9. [9]

    George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005. 22

  10. [10]

    C. F. Jeff Wu and Michael Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

  11. [11]

    George E. P. Box and K. B. Wilson.On the Experimental Attainment of Optimum Condi- tions. Journal of the Royal Statistical Society, Series B, 13(1):1-45, 1951

  12. [12]

    Myers, Douglas C

    Raymond H. Myers, Douglas C. Montgomery, and Christine M. Anderson-Cook.Response Surface Methodology: Process and Product Optimization Using Designed Experiments. Wiley, 4th edition, 2016

  13. [13]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  14. [14]

    Population Based Training of Neural Networks

    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu.Population Based Training of Neural Networks. arXiv:1711.09846, 2017. https://arxiv.org/abs/1711.09846

  15. [15]

    arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

    Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Ben- jamin Recht, and Ameet Talwalkar.A System for Massively Parallel Hyperparameter Tuning. arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

  16. [16]

    Cyclical Learning Rates for Training Neural Networks

    Leslie N. Smith.Cyclical Learning Rates for Training Neural Networks. IEEE Winter Conference on Applications of Computer Vision, 2017. arXiv:1506.01186.https://arxiv.org/ abs/1506.01186

  17. [17]

    Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

    Leslie N. Smith and Nicholay Topin.Super-Convergence: Very Fast Training of Neu- ral Networks Using Large Learning Rates. Proceedings of SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 2019. arXiv:1708.07120.https: //arxiv.org/abs/1708.07120

  18. [18]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention Is All You Need. arXiv:1706.03762, 2017. https://arxiv.org/abs/1706.03762 23