Scaling Laws for Neural Language Models
Pith reviewed 2026-05-24 15:29 UTC · model grok-4.3
The pith
Neural language model loss scales as a power law with model size, data size, and training compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-entropy loss L follows power-law scaling in model size N, dataset size D, and compute C, with the functional forms L(N) ~ N^(-α), L(D) ~ D^(-β), and L(C) ~ C^(-γ) holding across more than seven orders of magnitude; architectural details such as width and depth exert only minimal influence inside wide ranges, and the same relations determine optimal compute allocation, sample efficiency, and the point at which training should stop.
What carries the argument
Empirical power-law fits that relate loss directly to model size, dataset size, and compute.
If this is right
- For any fixed compute budget the lowest loss is achieved by training a very large model on a relatively small dataset and stopping well before convergence.
- Larger models require fewer training examples to reach a given loss level.
- The amount of overfitting is governed by a simple function of model size and dataset size.
- Training speed itself follows a predictable dependence on model size alone.
Where Pith is reading between the lines
- The same scaling relations could be tested on non-language tasks to check whether the exponents are domain-specific.
- If the laws remain accurate at still larger scales they would let researchers forecast the loss of a model before any training begins.
- The preference for large models on modest data shifts the economic trade-off between hardware and data collection.
Load-bearing premise
The power-law trends measured inside the tested range of sizes will continue to hold when models and datasets grow much larger.
What would settle it
Training a model whose parameter count lies an order of magnitude beyond the largest model studied and finding that its achieved loss lies well outside the band predicted by the fitted power laws.
Figures
read the original abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports empirical scaling laws showing that cross-entropy loss for neural language models follows power-law dependence on model size N, dataset size D, and compute C, with some trends spanning more than seven orders of magnitude. Architectural details such as width or depth have minimal effects within the tested range. Simple equations describe overfitting and training speed, which are used to derive optimal allocation of a fixed compute budget, favoring training of very large models on modest data and stopping before convergence.
Significance. If the observed power laws and derived allocations hold, the work supplies a quantitative basis for predicting performance and optimizing training efficiency across scales, with the broad empirical coverage (N up to ~10^9, C up to ~10^23 FLOPs) constituting a clear strength for guiding resource allocation in large-model development.
major comments (2)
- [§6, Eq. (6.3)–(6.5)] §6, Eq. (6.3)–(6.5): The optimal N*(C) and D*(C) are obtained by minimizing the fitted loss L(N,D) using the exponents reported in §3–4 (e.g., N^{-0.076}, D^{-0.103}, C^{-0.050}). Because these formulas are applied to budgets 10–100× beyond the measured range, the central claim that 'optimally compute-efficient training involves training very large models on a relatively modest amount of data' requires explicit bounds or sensitivity analysis on how deviations from power-law behavior (noted at low N/D) or a change of regime would shift the predicted minimum.
- [§3–4] §3–4: The power-law fits are reported to be good within the observed range, yet the manuscript notes small deviations at low N/D. The load-bearing step of extrapolating these same functional forms to derive the compute-efficiency optimum in §6 would be strengthened by a quantitative propagation of fit residuals or by hold-out validation at the largest scales tested.
minor comments (1)
- The notation for the loss function and the precise definition of compute C should be introduced with an equation number in the main text before the scaling plots are presented.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the empirical coverage and for the constructive comments on extrapolation. We respond to each major comment below.
read point-by-point responses
-
Referee: [§6, Eq. (6.3)–(6.5)] §6, Eq. (6.3)–(6.5): The optimal N*(C) and D*(C) are obtained by minimizing the fitted loss L(N,D) using the exponents reported in §3–4 (e.g., N^{-0.076}, D^{-0.103}, C^{-0.050}). Because these formulas are applied to budgets 10–100× beyond the measured range, the central claim that 'optimally compute-efficient training involves training very large models on a relatively modest amount of data' requires explicit bounds or sensitivity analysis on how deviations from power-law behavior (noted at low N/D) or a change of regime would shift the predicted minimum.
Authors: We agree that the central claim in §6 rests on extrapolation. The noted deviations from power-law scaling occur at low N/D; the derived optima lie well outside that regime. In the revised manuscript we will add an explicit sensitivity analysis that varies the fitted exponents within their reported uncertainties and recomputes N*(C) and D*(C) to quantify how the location of the minimum shifts. revision: yes
-
Referee: [§3–4] §3–4: The power-law fits are reported to be good within the observed range, yet the manuscript notes small deviations at low N/D. The load-bearing step of extrapolating these same functional forms to derive the compute-efficiency optimum in §6 would be strengthened by a quantitative propagation of fit residuals or by hold-out validation at the largest scales tested.
Authors: The manuscript already reports fit quality metrics and residuals for the power-law regimes in §3–4. To further support the extrapolation step, the revised version will include a quantitative propagation of the fit residuals into the uncertainty of the derived N*(C) and D*(C) curves. revision: yes
Circularity Check
Empirical scaling laws from direct experimental fits; optimal allocation is a derived consequence, not a reduction to inputs.
full rationale
The paper reports direct experimental measurements of cross-entropy loss across model sizes N, dataset sizes D, and compute C (spanning >7 orders of magnitude), then fits power-law forms L(N), L(D), and L(C) to those data points in sections 3-4. The optimal allocation rules in section 6 are obtained by analytically minimizing the fitted functional forms subject to a compute constraint C = 6ND; this is a straightforward mathematical consequence of the empirical fits rather than a self-definitional loop or a 'prediction' that is statistically forced to match the input data. No self-citations, imported uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained against its own experimental benchmarks within the measured regime.
Axiom & Free-Parameter Ledger
free parameters (4)
- power-law exponent for model size
- power-law exponent for dataset size
- power-law exponent for compute
- scaling equation coefficients
axioms (2)
- domain assumption Loss follows a power-law functional form in model size, data, and compute
- domain assumption Architectural details such as width and depth have minimal effects within wide ranges
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The loss scales as a power-law with model size, dataset size, and the amount of compute... L(N)=(N_c/N)^α_N ; α_N∼0.076, N_c∼8.8×10^13
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L(N,D)=[(N_c/N)^(α_N/α_D)+D_c/D]^α_D
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
An Open-Source Training Dataset for Foundation Models for Black-box Optimization
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
-
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets
Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching struc...
-
Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
Fixed tokens-per-parameter ratios in scaling law experiments induce ill-conditioned least-squares fits due to Jacobian geometry, making scale coefficients unidentifiable and extrapolations unreliable; diverse TPP cove...
-
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
The Query Channel: Information-Theoretic Limits of Masking-Based Explanations
Masking-based explanations are governed by the information capacity of the query channel, with reliable recovery achievable below capacity via sparse maximum-likelihood decoding but impossible above it.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
-
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
-
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
-
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States
Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Discovering Language Model Behaviors with Model-Written Evaluations
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Tokenisation via Convex Relaxations
ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.
-
Forecasting Scientific Progress with Artificial Intelligence
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and in...
-
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
-
Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks
Finite-width shallow networks remain within poly(d) m^{-min(1,c/6)} of their mean-field limit uniformly in time when mean-field excess loss decays as t^{-c} under standard regularity and an integral condition on the loss.
-
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training
AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.
-
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
-
PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels
PilotWiMAE pretrains an encoder on noisy pilots with factorized attention, 99% masking, patch-normalized reconstruction, scale loss, and AWGN curriculum to outperform supervised baselines in cross-frequency beam selec...
-
The Economics of AI Inference: Inflation Dynamics, Welfare Costs, and Optimal Monetary Policy under the Inference-Cost Phillips Curve
Develops the Inference-Cost Phillips Curve linking AI inference costs to inflation dynamics, derives structural slopes and optimal monetary policy, and reports empirical estimates from US and G7 data that align with t...
-
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
-
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
-
SNLP: Layer-Parallel Inference via Structured Newton Corrections
SNLP enables layer-parallel Transformer inference by replacing sequential layer execution with structured Newton corrections and SNLP-aware training regularization, yielding up to 2.3x wall-clock speedup on 0.5B model...
-
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
-
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
-
Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
Olivia harmonizes time series datasets via normalized power spectral density using a Harmonizer module and resonator-based HarmonicAttention, achieving state-of-the-art zero-shot, few-shot, and full-shot forecasting o...
-
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
-
Do Language Models Align with Brains? Prediction Scores Are Not Enough
Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
-
How Much is Brain Data Worth for Machine Learning?
Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Reference graph
Works this paper leans on
-
[1]
High-dimensional dynamics of generalization error in neural networks
25 [AS17] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv, 2017, 1710.03667. 11, 18, 22 [BB01] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics, page...
-
[2]
Proceedings of the National Academy of Sciences , volume =
18 [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv, 2018, 1812.11118. 18 [Bia12] GÊrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063–1095,
-
[3]
Generating Long Sequences with Sparse Transformers
18 [CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509. 19 [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understandi...
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[4]
Gradient Descent Happens in a Tiny Subspace
25 [Fou] The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org. 7 [GARD18] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 [GJS+19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthie...
-
[5]
18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma
URL http://arxiv.org/abs/cs.CL/0108005. 18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com,
-
[6]
ACM. doi:10.1145/3293883.3295710. 18 28 [HCC+18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965. 19 [HNA+17] Joel Hestness, Sharan Narang, Newsha ...
-
[7]
Adam: A Method for Stochastic Optimization
18 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International C...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
URL http://dl.acm.org/citation.cfm?id=2999134.2999257
Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257. 19 [LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi...
-
[9]
Wide neural networks of any depth evolv e as linear models under gradient descent
25 [LXS+19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch tra...
-
[10]
arXiv preprint arXiv:1909.12673 , year=
2, 6 [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 ...
-
[11]
Mesh-TensorFlow: Deep Learning for Supercomputers
2, 5, 6, 7, 8 [SCP+18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19 [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine t...
-
[12]
18 [TL19] Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946. 18 [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. I...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[13]
2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie
URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 [Was06] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media,
work page Pith/arXiv arXiv 2016
-
[14]
18 [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in- creasing model capacity....
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[15]
Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =
doi:10.1109/cvpr.2017.323. 19 [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 [YDY+19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V . Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. ...
-
[16]
doi:10.5244/c.30.87. 18 [ZKZ+15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), Dec
-
[17]
doi:10.1109/iccv.2015.11. 7 [ZLN+19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR, abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164. 12, 18 30
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.