arxiv: 1712.00409 · v1 · submitted 2017-12-01 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

Deep Learning Scaling is Predictable, Empirically

Joel Hestness , Sharan Narang , Newsha Ardalani , Gregory Diamos , Heewoo Jun , Hassan Kianinejad , Md. Mostofa Ali Patwary , Yang Yang

show 1 more author

Yanqi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords deep learning scalingpower lawgeneralization errortraining data sizemodel sizeempirical studymachine translationlanguage modeling

0 comments

The pith

Deep learning generalization error decreases as a power law of training set size across multiple domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how error rates change when training data grows much larger, testing four domains: machine translation, language modeling, image processing, and speech recognition. It finds that error follows a power-law relationship with data size, so that each doubling of data produces a predictable fractional drop in error. Model changes or other improvements move the entire curve up or down but leave the scaling exponent unchanged. Optimal model size grows slower than linearly with data volume. These patterns let researchers forecast returns from adding data or compute without running every experiment at full scale.

Core claim

Empirical tests show that generalization error scales as a power law with training set size in each of the four domains examined. The exponent that sets the rate of improvement stays the same when architectures or other model improvements are introduced; those changes only shift the absolute error level. Model size needed for best performance scales sublinearly with data size. The measurements cover a wide range of data volumes and produce consistent scaling behavior within the tested regimes.

What carries the argument

Power-law scaling of generalization error with training set size

If this is right

Accuracy targets can be set by extrapolating the measured power law rather than by exhaustive trial runs.
Decisions on whether to collect more data can be guided by the expected error reduction per additional example.
Model architecture work can be assessed by how far it shifts the error curve rather than by any change in the scaling rate.
Hardware and systems planning can use the sublinear model-size relation to estimate compute needs as datasets grow.
Continued scaling of data and compute is expected to deliver steady, predictable gains within the domains studied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling persists, theoretical explanations should target why the observed exponents take their particular values rather than only proving existence of some scaling.
The invariance of the exponent under model changes suggests that data volume may dominate long-term progress more than incremental architectural tweaks.
Sublinear growth of model size with data implies that the computational cost per example falls as datasets enlarge, improving efficiency at scale.
A break in the power law at extreme sizes would signal a new regime, such as exhaustion of useful information in the data source.

Load-bearing premise

The power-law relationships seen in the tested range of data and model sizes will continue without breaks when both are made much larger.

What would settle it

A new experiment at ten times the largest data volume tested here that shows error deviating from the fitted power-law curve by more than the observed variation.

read the original abstract

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps out power-law drops in generalization error with data size across four domains and shows architecture changes shift the level but not the slope, though these are fits within tested ranges.

read the letter

The main thing to know is that this work finds consistent power-law scaling for generalization error versus training set size in machine translation, language modeling, image processing, and speech recognition. Model improvements move the entire curve up or down without changing the exponent, and optimal model size grows sublinearly with data size. They used a uniform measurement approach across the domains, which is the clearest new piece here. The empirical numbers on the exponents and the sublinear scaling observation give a practical handle on how performance improves with resources in the regimes they actually ran. That kind of cross-domain data is useful for planning experiments even without a theory for why the exponents take the values they do. The soft spots are straightforward. The power laws are descriptive fits to the data volumes and model sizes they trained; the paper does not test or model possible breaks, saturation, or exponent changes at substantially larger scales. There is also little on error bars for the fitted exponents or sensitivity to the exact range of points used in the fit. These are typical limits for early empirical scaling work and do not invalidate the observations inside the tested window, but they do mean the predictability claim is provisional. This paper is for people who design large training runs or systems and need rough quantitative guidance on returns to data and compute. The experiments are real and the patterns are reproducible enough to be worth checking. I would send it to peer review so referees can examine the fitting methodology and ask for more on extrapolation limits.

Referee Report

3 major / 3 minor

Summary. The manuscript presents a large-scale empirical study of scaling in deep learning across four domains (machine translation, language modeling, image processing, and speech recognition). It claims that generalization error follows a power-law dependence on training set size, that the power-law exponent is invariant to model architecture improvements (which only shift the prefactor), and that optimal model size grows sublinearly with data size. These relationships are positioned as making DL scaling predictable, with implications for research, practice, and systems design.

Significance. If the reported power-law relationships hold, the work provides a valuable empirical foundation for quantifying the benefits of scaling data and compute in deep learning. The cross-domain consistency and the observation that architecture changes primarily affect the constant term rather than the exponent are particularly useful for guiding practical decisions on data collection and model sizing. The study also highlights open theoretical questions about the origin of the exponents.

major comments (3)

[§3 (Experimental Methodology)] §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.
[Results sections (e.g., §4.1–4.4)] Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.
[§5 (Discussion)] §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.

minor comments (3)

[Abstract and §1] The abstract and introduction would benefit from explicitly stating the numerical values of the observed exponents and the precise functional form used for the power-law fits.
[Figures] Figures showing learning curves should overlay the fitted power-law curves and report the fitted parameters for direct visual assessment of fit quality.
[§2 (Related Work)] A brief comparison to prior empirical scaling observations (e.g., in speech or vision) would help situate the novelty of the cross-domain results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful for the referee's positive assessment and constructive suggestions for improving the manuscript. We address each of the major comments below.

read point-by-point responses

Referee: §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.

Authors: We agree that more detail on subset construction is needed for reproducibility. The training subsets were constructed via random sampling (without replacement) from the full training set to maintain distributional properties. We will revise §3 to include a clear description of this sampling method and any verification steps for distributional equivalence. revision: yes
Referee: Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.

Authors: We acknowledge the value of these statistical measures. Although the fits were consistent across domains and visually robust, we will add error bars (from repeated trials where feasible), report R² and other fit statistics, and perform sensitivity analysis on the fitting range in the revised results sections. revision: yes
Referee: §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.

Authors: The claims are grounded in the empirical observations within the tested regimes. We will expand the Discussion section to address potential limitations at larger scales, including possible saturation or exponent changes, based on trends at the upper limits of our experiments and related literature. However, empirical analysis at substantially larger scales is beyond the scope of this work due to resource constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling laws are direct observations, not reductions to fitted inputs

full rationale

The manuscript reports measured power-law relationships between generalization error and factors such as training-set size, model size, and compute across four domains. These relationships are obtained by fitting functional forms to experimental data points collected within the tested regimes; the paper does not derive the power-law exponents from prior equations, self-citations, or uniqueness theorems that would make the reported scaling equivalent to its own inputs by construction. Model-size sublinearity is likewise an observed trend, not a prediction forced by the fitting procedure itself. Because the central claims rest on reproducible empirical measurements rather than any self-referential derivation chain, the analysis is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical observation and power-law fitting to measured error curves; no new theoretical entities or derivations are introduced beyond standard statistical fitting assumptions.

free parameters (1)

power-law exponent
Fitted separately per domain to describe the rate of error reduction with data size.

axioms (1)

domain assumption Power-law functional form adequately captures the scaling relationship over the measured range
Invoked to summarize observed curves; no derivation from first principles is provided.

pith-pipeline@v0.9.0 · 5557 in / 1230 out tokens · 63530 ms · 2026-05-12T03:56:18.492891+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the 'steepness' of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KAN: Kolmogorov-Arnold Networks
cs.LG 2024-04 conditional novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
stat.ML 2026-05 unverdicted novelty 7.0

In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Decision Boundary-aware Generation for Long-tailed Learning
cs.CV 2026-05 unverdicted novelty 7.0

DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.
Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection
cs.LG 2026-04 conditional novelty 7.0

A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
cs.CL 2026-04 unverdicted novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
cs.LG 2026-05 conditional novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks
cs.LG 2026-05 unverdicted novelty 6.0

Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
cs.CL 2026-05 unverdicted novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
The Power of Power Law: Asymmetry Enables Compositional Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...
Large language model-enabled automated data extraction for concrete materials informatics
cond-mat.mtrl-sci 2026-04 unverdicted novelty 6.0

LLM pipeline extracts nearly 9,000 high-quality blended-cement concrete records from over 27,000 publications with F1 scores up to 0.97 and enables ML analyses showing benefits of large diverse datasets.
Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification
cs.SD 2026-04 unverdicted novelty 6.0

TRIAGE adaptively scales test-time compute via tiered zero-shot stages for respiratory audio classification, reaching mean AUROC 0.744 across nine tasks while outperforming prior zero-shot methods.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
cs.CV 2022-11 unverdicted novelty 6.0

An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
cs.LG 2026-05 unverdicted novelty 5.0

Recurrent networks built from tunable expressive neurons reveal scaling laws with an optimal parameter split that shifts toward higher per-neuron complexity at larger scales.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
math.OC 2026-05 unverdicted novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Physical Foundation Models: Fixed hardware implementations of large-scale neural networks
cs.LG 2026-04 unverdicted novelty 5.0

Physical Foundation Models are fixed physical hardware realizations of foundation-scale neural networks that compute via inherent material dynamics, potentially delivering orders-of-magnitude gains in energy efficienc...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
math.NA 2026-04 unverdicted novelty 5.0

The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning
cs.AI 2026-04 unverdicted novelty 5.0

CoCoGen+ models each federated learning round as a weighted potential game with strategic synthetic data generation and payoff redistribution incentives, showing improved efficiency over baselines under non-IID data a...
Towards Scaling Law Analysis For Spatiotemporal Weather Data
cs.LG 2026-04 unverdicted novelty 5.0

Scaling laws for weather models exhibit strong cross-channel and cross-horizon heterogeneity, where globally pooled metrics appear favorable while many individual channels degrade at longer leads.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Superposition Yields Robust Neural Scaling
cs.LG 2025-05

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 36 Pith papers · 2 internal anchors

[1]

Bahdanau, J

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Bengio. End-to-end Attention-based Large V ocabu- lary Speech Recognition. arXiv preprint arXiv:1508.04395v2,

work page arXiv
[2]

Battenberg, J

E. Battenberg, J. Chen, R. Child, A. Coates, Y . Gaur, Y . Li, H. Liu, S. Satheesh, D. Seetapun, A. Sri- ram, and Z. Zhu. Exploring Neural Transducers for End-to-end Speech Recognition. arXiv preprint arXiv:1707.07413,

work page arXiv
[3]

One billion word benchmark for measuring progress in statistical language modeling

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005,

work page arXiv
[4]

Hannun, C

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep Speech: Scaling Up End-to-End Speech Recognition.arXiv preprint arXiv:1412.5567,

work page arXiv
[5]

Exploring the Limits of Language Modeling

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2,

work page Pith review arXiv
[6]

ArXiv preprint arXiv:1710.05468 , year=

K. Kawaguchi, L. P. Kaelbling, and Y . Bengio. Generalization in Deep Learning. arXiv preprint arXiv:1710.05468v1, October

work page arXiv
[7]

Berg, and Li Fei-Fei

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern- stein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575, January

work page arXiv
[8]

Morgan Kaufmann Publishers Inc. R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Subword Units.arXiv preprint arXiv:1508.07909, 2016a. R. Sennrich, B. Haddow, and A. Birch. Edinburgh Neural Machine Translation Systems for WMT

work page internal anchor Pith review arXiv
[9]

arXiv preprint arXiv:1606.02891, 2016b. H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical Mechanics of Learning from Examples. Physical Review A, 45:6056–6091, April

work page arXiv
[10]

S. L. Smith and Q. V . Le. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2, October

work page arXiv
[11]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding Deep Learning Requires Rethinking Generalization. arXiv preprint arXiv:1611.03530v2,

work page internal anchor Pith review arXiv
[12]

Similar to word language models, we use normalized cross-entropy loss:− 1 N ∑ ilnpwi, wherepwi is the model’s predicted probability of seeing theith token

The output space isO =C. Similar to word language models, we use normalized cross-entropy loss:− 1 N ∑ ilnpwi, wherepwi is the model’s predicted probability of seeing theith token. N is either the number of sequences in a batch for training optimization orN is the number of predicted characters in the validation set. A.3 I MAGE CLASSIFICATION ImageNet ima...

work page 2006