Staged Factorial Screening for Budget-Constrained Micro-Pretraining

Felipe Chavarro Polania

arxiv: 2606.05186 · v1 · pith:W2F3KLRMnew · submitted 2026-04-27 · 💻 cs.LG · cs.CL

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

Felipe Chavarro Polania This is my paper

Pith reviewed 2026-07-01 08:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords micro-pretrainingfactorial screeningbudget-constrained searchstaged experimentseffect estimationanchor confirmationhyperparameter triage

0 comments

The pith

Staged fractional-factorial screens identify high-penalty directions early and support bridge anchors through 24 hours on two hosts in budget-constrained micro-pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a staged workflow of short fractional-factorial screens can recover stable early effect estimates to triage candidate training recipes when accelerator time is limited. Hundreds of runs at 2 to 10 minutes on a fixed single-GPU loop show that penalties from total batch size, depth, and width are largest at the shortest durations and weaken as time increases. After correction, several factors retain detectable effects at 5 and 10 minutes while one does not, and random search finds competitive points but clusters in low-penalty regions without explaining why. Longer bridge runs at 60 minutes and seeded 12- and 24-hour continuations on two hosts place the bridge package lowest in mean loss, though the ordering of non-bridge options varies by host. The result is a bounded recommendation to screen briefly, confirm anchors repeatedly, and refine inside the reduced space rather than claim hardware-invariant rankings or broad hyperparameter-optimization superiority.

Core claim

On a fixed autoresearch-derived single-GPU training loop, 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes, full 16-condition seeded reruns, targeted anchor checks, same-host baselines, a 60-minute bridge package, and bounded 12- and 24-hour three-anchor continuations on Windows A100 and Linux L40S hosts show that main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, factors D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction while E does not. Random search reaches strong incumbents in the 32-condition

What carries the argument

staged fractional-factorial workflow that runs short designed screens to estimate and remove high-penalty directions before committing longer budgets to confirmation and local refinement

If this is right

Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases.
Within predeclared seeded full-screen families, factors D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction while E does not.
Random search reaches strong incumbents in the 32-condition space but repeatedly in the same low-penalty region and without factor attribution.
The 60-minute bridge package has the lowest mean, and in bounded 12- and 24-hour three-anchor continuations on both hosts the bridge has the lowest sample mean while non-bridge ordering stays host-sensitive.
The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed relaxation of penalties with budget suggests that early screens could be reused across multiple model scales if the factor set is held constant.
Host-sensitive ordering at 24 hours implies that separate short screens per hardware class may be required for stable recommendations rather than a single universal ranking.
Because random search already locates strong points inside the low-penalty region, the added value of the factorial workflow lies mainly in the attribution step that guides later refinement.
Extending the same staged design to include interaction terms or additional factors such as optimizer choice would test whether the current main-effects focus remains sufficient at longer budgets.

Load-bearing premise

The autoresearch-derived single-GPU training loop and predeclared factor set recover stable early effect structure that remains informative when budget increases to 60 minutes and 24 hours.

What would settle it

If 24-hour repeated runs on both hosts show any non-bridge anchor achieving a lower sample mean than the bridge anchor, or if short-screen effect estimates fail to predict the ordering observed at 60 minutes and beyond, the bridge-centered recommendation would not hold.

Figures

Figures reproduced from arXiv: 2606.05186 by Felipe Chavarro Polania.

**Figure 2.** Figure 2: In the seeded confirmation subset, variance is dominated by budget and condition struc [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: The 60-minute bridge package keeps the reduced-space bridge best and the predeclared control worst. Points show seed runs; black markers show means with 95% confidence intervals. role 10 min mean 60 min mean 95% CI Descriptive cross-seed win counts in the same package: • bridge_best < greedy: 16/16 • screened_best < greedy: 1/16 • bridge_best < control: 16/16 • greedy < control: 16/16 • bridge_best - greed… view at source ↗

**Figure 4.** Figure 4: Dual-host long-horizon anchor packages at [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows short factorial screens can flag penalties in one micro-pretraining loop but stays narrow and host-sensitive with no general claims.

read the letter

The main thing here is a bounded empirical demonstration: staged fractional factorial screens at short budgets (2-10 minutes) recover non-zero effects for some factors after Benjamini-Hochberg correction, and those signals line up with longer 60-minute and 24-hour anchor runs on two hosts. The authors run 613 experiments, include random and greedy baselines, and are explicit that the result does not claim hardware-invariant rankings or broad HPO superiority.

What stands out is the care in the experimental design. They predeclare factors, run pilot and full screens, repeat anchors, and track how penalties relax with budget. The within-budget correction and the comparison of ordering stability across Windows A100 and Linux L40S add concrete detail that is often missing in hyperparameter papers. The 24-hour continuations give the bounded recommendation some grounding.

The soft spots are real but proportionate. The top bridge anchor mixes the screening workflow with a larger model capacity, so the performance edge cannot be cleanly attributed to the factorial method. Factor ordering changes between hosts, which limits how much the specific recommendations travel. Everything is locked to one autoresearch-derived single-GPU loop, so recovery of early effect structure is shown only inside that setup. No new math or first-principles result appears.

This is for practitioners doing budget-constrained micro-pretraining on single GPUs who need a structured alternative to pure random search. A reader already working in that narrow regime will find the workflow and the host-sensitivity checks useful. The paper deserves peer review because the experiments are thorough for the stated scope, the limitations are flagged rather than hidden, and the central claim stays within what the data can support.

Referee Report

0 major / 3 minor

Summary. The paper claims that a staged fractional-factorial screening workflow can recover stable early effect structure for budget-constrained micro-pretraining on a fixed single-GPU autoresearch loop. Across 613 experiments (pilot/follow-up screens at 2/5/10 min, full 16-condition reruns, anchor checks, greedy/random baselines, 60-min bridge, and 12/24-hour continuations on Windows A100 and Linux L40S), main penalties from total batch, depth, and width are largest at short budgets and relax later; within predeclared seeded families, factors D/A/B/C retain non-zero estimates after within-budget BH correction while E does not. Random search reaches strong incumbents but without attribution. The 60-min bridge shows lowest mean (though confounded with capacity), and 24-hour three-anchor runs favor the bridge on both hosts while non-bridge ordering is host-sensitive. The central recommendation is therefore bounded: use short designed screens to identify high-penalty directions, confirm anchors under repetition, and refine locally in the reduced space.

Significance. If the bounded empirical findings hold, the work supplies a practical, statistically controlled method for triaging candidate recipes under tight accelerator budgets before committing larger resources. Strengths include the explicit framing as a methods result tied to a fixed training loop and predeclared factors, the use of repeated anchor runs and within-budget multiple-testing correction, and direct probing of effect stability via the 60-min bridge and 24-hour continuations. It does not claim hardware-invariant rankings or general HPO superiority.

minor comments (3)

[Abstract, §3] Abstract and §3: the exact definitions and level settings for the five predeclared factors (A–E) are referenced but not enumerated in the provided text; a concise table or appendix listing them would improve reproducibility.
[§4, Table 2] §4 and Table 2: the precise implementation of the within-budget Benjamini-Hochberg correction (e.g., how p-values are pooled across the 5- and 10-minute screens) is described at high level; an explicit formula or pseudocode step would clarify whether the reported non-zero estimates for D/A/B/C survive the exact procedure used.
[Figure 3, §5.2] Figure 3 and §5.2: the 24-hour continuation plots show host-sensitive ordering for non-bridge anchors, but axis scaling and error-bar conventions differ slightly between hosts; uniform formatting would aid direct visual comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the clear summary of the manuscript, the positive assessment of its scope and limitations, and the recommendation for minor revision. No specific major comments are enumerated in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical results self-contained

full rationale

The manuscript reports results from a fixed experimental loop of 613 runs using predeclared factors, short screens, BH correction, repeated anchors, and bounded continuations. No equations, derivations, or predictions appear; all claims reduce to direct sample means and within-budget corrections on the observed data. No self-citations are load-bearing for the central bounded-methods recommendation. The work is self-contained against the described training loop and does not invoke fitted parameters renamed as predictions or uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only; ledger populated from stated experimental assumptions. Relies on standard statistical assumptions for factorial designs and multiple testing. No invented entities. Free parameters are the five labeled factors whose levels are chosen by the authors.

free parameters (1)

Factor levels for A, B, C, D, E
Specific values of batch, depth, width and two additional factors chosen for the 32-condition space; fitted or selected to define the screen.

axioms (2)

domain assumption Fractional factorial design recovers stable early effect structure in autoregressive pretraining loops
Invoked when claiming that short-budget penalties remain informative at longer budgets.
standard math Benjamini-Hochberg correction controls false discoveries within predeclared seeded families
Used to declare non-zero estimates for D, A, B, C at 5 and 10 minutes.

pith-pipeline@v0.9.1-grok · 5845 in / 1553 out tokens · 36923 ms · 2026-07-01T08:36:58.839578+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
cs.CL 2026-06 unverdicted novelty 4.0

Case study applies frozen staged budgets and promotion rules to twelve micro-pretraining configurations, identifying a top bridge condition at 12 hours with 169 GPU-hours total versus higher counterfactual costs.

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

GitHub repository

Andrej Karpathy.autoresearch. GitHub repository. Available at:https://github.com/ karpathy/autoresearch
[2]

SkyPilot documentation.Parallel autoresearch. Available at:https://docs.skypilot.co/ en/latest/examples/agents/autoresearch.html [3]Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Lan- guage Model Pretraining. arXiv:2503.04715.https://arxiv.org/abs/2503.04715 [4]Principled Architecture-aware Scaling of Hyperparameters. a...

work page arXiv
[3]

Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

James Bergstra and Yoshua Bengio.Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

2012
[4]

Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance. Proceedings of the 31st International Conference on Machine Learning, PMLR 32, 2014.https://proceedings.mlr.press/v32/hutter14.html

2014
[5]

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1-52, 2018.https://jmlr.org/beta/papers/v18/16-558.html

2018
[6]

Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

Stefan Falkner, Aaron Klein, and Frank Hutter.BOHB: Robust and Efficient Hyperparam- eter Optimization at Scale. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

2018
[7]

Adams.Practical Bayesian Optimization of Machine Learning Algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems 25, 2012. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract. html

work page arXiv 2012
[8]

Montgomery.Design and Analysis of Experiments

Douglas C. Montgomery.Design and Analysis of Experiments. Wiley, 10th edition, 2019

2019
[9]

George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005. 22

2005
[10]

C. F. Jeff Wu and Michael Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

2009
[11]

George E. P. Box and K. B. Wilson.On the Experimental Attainment of Optimum Condi- tions. Journal of the Royal Statistical Society, Series B, 13(1):1-45, 1951

1951
[12]

Myers, Douglas C

Raymond H. Myers, Douglas C. Montgomery, and Christine M. Anderson-Cook.Response Surface Methodology: Process and Product Optimization Using Designed Experiments. Wiley, 4th edition, 2016

2016
[13]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu.Population Based Training of Neural Networks. arXiv:1711.09846, 2017. https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Ben- jamin Recht, and Ameet Talwalkar.A System for Massively Parallel Hyperparameter Tuning. arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

work page arXiv 2018
[16]

Cyclical Learning Rates for Training Neural Networks

Leslie N. Smith.Cyclical Learning Rates for Training Neural Networks. IEEE Winter Conference on Applications of Computer Vision, 2017. arXiv:1506.01186.https://arxiv.org/ abs/1506.01186

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith and Nicholay Topin.Super-Convergence: Very Fast Training of Neu- ral Networks Using Large Learning Rates. Proceedings of SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 2019. arXiv:1708.07120.https: //arxiv.org/abs/1708.07120

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention Is All You Need. arXiv:1706.03762, 2017. https://arxiv.org/abs/1706.03762 23

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

GitHub repository

Andrej Karpathy.autoresearch. GitHub repository. Available at:https://github.com/ karpathy/autoresearch

[2] [2]

SkyPilot documentation.Parallel autoresearch. Available at:https://docs.skypilot.co/ en/latest/examples/agents/autoresearch.html [3]Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Lan- guage Model Pretraining. arXiv:2503.04715.https://arxiv.org/abs/2503.04715 [4]Principled Architecture-aware Scaling of Hyperparameters. a...

work page arXiv

[3] [3]

Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

James Bergstra and Yoshua Bengio.Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10):281-305, 2012.https://jmlr.org/beta/papers/ v13/bergstra12a.html

2012

[4] [4]

Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown.An Efficient Approach for As- sessing Hyperparameter Importance. Proceedings of the 31st International Conference on Machine Learning, PMLR 32, 2014.https://proceedings.mlr.press/v32/hutter14.html

2014

[5] [5]

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1-52, 2018.https://jmlr.org/beta/papers/v18/16-558.html

2018

[6] [6]

Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

Stefan Falkner, Aaron Klein, and Frank Hutter.BOHB: Robust and Efficient Hyperparam- eter Optimization at Scale. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018.https://proceedings.mlr.press/v80/falkner18a.html

2018

[7] [7]

Adams.Practical Bayesian Optimization of Machine Learning Algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems 25, 2012. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract. html

work page arXiv 2012

[8] [8]

Montgomery.Design and Analysis of Experiments

Douglas C. Montgomery.Design and Analysis of Experiments. Wiley, 10th edition, 2019

2019

[9] [9]

George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005. 22

2005

[10] [10]

C. F. Jeff Wu and Michael Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

2009

[11] [11]

George E. P. Box and K. B. Wilson.On the Experimental Attainment of Optimum Condi- tions. Journal of the Royal Statistical Society, Series B, 13(1):1-45, 1951

1951

[12] [12]

Myers, Douglas C

Raymond H. Myers, Douglas C. Montgomery, and Christine M. Anderson-Cook.Response Surface Methodology: Process and Product Optimization Using Designed Experiments. Wiley, 4th edition, 2016

2016

[13] [13]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu.Population Based Training of Neural Networks. arXiv:1711.09846, 2017. https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Ben- jamin Recht, and Ameet Talwalkar.A System for Massively Parallel Hyperparameter Tuning. arXiv:1810.05934, 2018.https://arxiv.org/abs/1810.05934

work page arXiv 2018

[16] [16]

Cyclical Learning Rates for Training Neural Networks

Leslie N. Smith.Cyclical Learning Rates for Training Neural Networks. IEEE Winter Conference on Applications of Computer Vision, 2017. arXiv:1506.01186.https://arxiv.org/ abs/1506.01186

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith and Nicholay Topin.Super-Convergence: Very Fast Training of Neu- ral Networks Using Large Learning Rates. Proceedings of SPIE 11006, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, 2019. arXiv:1708.07120.https: //arxiv.org/abs/1708.07120

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention Is All You Need. arXiv:1706.03762, 2017. https://arxiv.org/abs/1706.03762 23

work page internal anchor Pith review Pith/arXiv arXiv 2017