pith. machine review for the scientific record. sign in

arxiv: 1712.00409 · v1 · submitted 2017-12-01 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

Deep Learning Scaling is Predictable, Empirically

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords deep learning scalingpower lawgeneralization errortraining data sizemodel sizeempirical studymachine translationlanguage modeling
0
0 comments X

The pith

Deep learning generalization error decreases as a power law of training set size across multiple domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how error rates change when training data grows much larger, testing four domains: machine translation, language modeling, image processing, and speech recognition. It finds that error follows a power-law relationship with data size, so that each doubling of data produces a predictable fractional drop in error. Model changes or other improvements move the entire curve up or down but leave the scaling exponent unchanged. Optimal model size grows slower than linearly with data volume. These patterns let researchers forecast returns from adding data or compute without running every experiment at full scale.

Core claim

Empirical tests show that generalization error scales as a power law with training set size in each of the four domains examined. The exponent that sets the rate of improvement stays the same when architectures or other model improvements are introduced; those changes only shift the absolute error level. Model size needed for best performance scales sublinearly with data size. The measurements cover a wide range of data volumes and produce consistent scaling behavior within the tested regimes.

What carries the argument

Power-law scaling of generalization error with training set size

If this is right

  • Accuracy targets can be set by extrapolating the measured power law rather than by exhaustive trial runs.
  • Decisions on whether to collect more data can be guided by the expected error reduction per additional example.
  • Model architecture work can be assessed by how far it shifts the error curve rather than by any change in the scaling rate.
  • Hardware and systems planning can use the sublinear model-size relation to estimate compute needs as datasets grow.
  • Continued scaling of data and compute is expected to deliver steady, predictable gains within the domains studied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling persists, theoretical explanations should target why the observed exponents take their particular values rather than only proving existence of some scaling.
  • The invariance of the exponent under model changes suggests that data volume may dominate long-term progress more than incremental architectural tweaks.
  • Sublinear growth of model size with data implies that the computational cost per example falls as datasets enlarge, improving efficiency at scale.
  • A break in the power law at extreme sizes would signal a new regime, such as exhaustion of useful information in the data source.

Load-bearing premise

The power-law relationships seen in the tested range of data and model sizes will continue without breaks when both are made much larger.

What would settle it

A new experiment at ten times the largest data volume tested here that shows error deviating from the fitted power-law curve by more than the observed variation.

read the original abstract

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents a large-scale empirical study of scaling in deep learning across four domains (machine translation, language modeling, image processing, and speech recognition). It claims that generalization error follows a power-law dependence on training set size, that the power-law exponent is invariant to model architecture improvements (which only shift the prefactor), and that optimal model size grows sublinearly with data size. These relationships are positioned as making DL scaling predictable, with implications for research, practice, and systems design.

Significance. If the reported power-law relationships hold, the work provides a valuable empirical foundation for quantifying the benefits of scaling data and compute in deep learning. The cross-domain consistency and the observation that architecture changes primarily affect the constant term rather than the exponent are particularly useful for guiding practical decisions on data collection and model sizing. The study also highlights open theoretical questions about the origin of the exponents.

major comments (3)
  1. [§3 (Experimental Methodology)] §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.
  2. [Results sections (e.g., §4.1–4.4)] Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.
  3. [§5 (Discussion)] §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.
minor comments (3)
  1. [Abstract and §1] The abstract and introduction would benefit from explicitly stating the numerical values of the observed exponents and the precise functional form used for the power-law fits.
  2. [Figures] Figures showing learning curves should overlay the fitted power-law curves and report the fitted parameters for direct visual assessment of fit quality.
  3. [§2 (Related Work)] A brief comparison to prior empirical scaling observations (e.g., in speech or vision) would help situate the novelty of the cross-domain results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful for the referee's positive assessment and constructive suggestions for improving the manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: §3 (Experimental Methodology): The description of how training subsets of varying sizes were constructed lacks detail on sampling method (e.g., random vs. contiguous) and any controls to ensure distributional equivalence across scales; without this, it is difficult to rule out selection effects that could artifactually produce or alter the observed power-law exponents.

    Authors: We agree that more detail on subset construction is needed for reproducibility. The training subsets were constructed via random sampling (without replacement) from the full training set to maintain distributional properties. We will revise §3 to include a clear description of this sampling method and any verification steps for distributional equivalence. revision: yes

  2. Referee: Results sections (e.g., §4.1–4.4): No error bars, confidence intervals, or goodness-of-fit statistics (such as R² or residual analysis) are reported for the fitted power-law exponents, and there is no sensitivity analysis to the choice of fitting range; this weakens the ability to assess the robustness of the central scaling claims.

    Authors: We acknowledge the value of these statistical measures. Although the fits were consistent across domains and visually robust, we will add error bars (from repeated trials where feasible), report R² and other fit statistics, and perform sensitivity analysis on the fitting range in the revised results sections. revision: yes

  3. Referee: §5 (Discussion): The claim that scaling is 'predictable' rests on the power-law form and exponents persisting beyond the tested regimes, yet the manuscript contains no analysis or discussion of possible breaks, saturation, or changes in effective exponent at substantially larger data volumes or model capacities.

    Authors: The claims are grounded in the empirical observations within the tested regimes. We will expand the Discussion section to address potential limitations at larger scales, including possible saturation or exponent changes, based on trends at the upper limits of our experiments and related literature. However, empirical analysis at substantially larger scales is beyond the scope of this work due to resource constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling laws are direct observations, not reductions to fitted inputs

full rationale

The manuscript reports measured power-law relationships between generalization error and factors such as training-set size, model size, and compute across four domains. These relationships are obtained by fitting functional forms to experimental data points collected within the tested regimes; the paper does not derive the power-law exponents from prior equations, self-citations, or uniqueness theorems that would make the reported scaling equivalent to its own inputs by construction. Model-size sublinearity is likewise an observed trend, not a prediction forced by the fitting procedure itself. Because the central claims rest on reproducible empirical measurements rather than any self-referential derivation chain, the analysis is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical observation and power-law fitting to measured error curves; no new theoretical entities or derivations are introduced beyond standard statistical fitting assumptions.

free parameters (1)
  • power-law exponent
    Fitted separately per domain to describe the rate of error reduction with data size.
axioms (1)
  • domain assumption Power-law functional form adequately captures the scaling relationship over the measured range
    Invoked to summarize observed curves; no derivation from first principles is provided.

pith-pipeline@v0.9.0 · 5557 in / 1230 out tokens · 63530 ms · 2026-05-12T03:56:18.492891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the 'steepness' of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KAN: Kolmogorov-Arnold Networks

    cs.LG 2024-04 conditional novelty 8.0

    KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

  2. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  3. Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

    stat.ML 2026-05 unverdicted novelty 7.0

    In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).

  4. Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

    math.OC 2026-05 unverdicted novelty 7.0

    Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

  5. Decision Boundary-aware Generation for Long-tailed Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.

  6. Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection

    cs.LG 2026-04 conditional novelty 7.0

    A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...

  7. Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

    cs.CL 2026-04 unverdicted novelty 7.0

    Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.

  8. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  9. Scaling Laws for Autoregressive Generative Modeling

    cs.LG 2020-10 accept novelty 7.0

    Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

  10. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  11. Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

    cs.LG 2026-05 conditional novelty 6.0

    A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

  12. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  13. A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.

  14. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.

  15. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  16. InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

    cs.CL 2026-05 unverdicted novelty 6.0

    InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...

  17. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  18. The Power of Power Law: Asymmetry Enables Compositional Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...

  19. Large language model-enabled automated data extraction for concrete materials informatics

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 6.0

    LLM pipeline extracts nearly 9,000 high-quality blended-cement concrete records from over 27,000 publications with F1 scores up to 0.97 and enables ML analyses showing benefits of large diverse datasets.

  20. Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

    cs.SD 2026-04 unverdicted novelty 6.0

    TRIAGE adaptively scales test-time compute via tiered zero-shot stages for respiratory audio classification, reaching mean AUROC 0.744 across nine tasks while outperforming prior zero-shot methods.

  21. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  22. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  23. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  24. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    cs.CV 2022-11 unverdicted novelty 6.0

    An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.

  25. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  26. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  27. Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

    cs.LG 2026-05 unverdicted novelty 5.0

    Recurrent networks built from tunable expressive neurons reveal scaling laws with an optimal parameter split that shifts toward higher per-neuron complexity at larger scales.

  28. Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    math.OC 2026-05 unverdicted novelty 5.0

    Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

  29. Physical Foundation Models: Fixed hardware implementations of large-scale neural networks

    cs.LG 2026-04 unverdicted novelty 5.0

    Physical Foundation Models are fixed physical hardware realizations of foundation-scale neural networks that compute via inherent material dynamics, potentially delivering orders-of-magnitude gains in energy efficienc...

  30. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 5.0

    Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...

  31. Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches

    math.NA 2026-04 unverdicted novelty 5.0

    The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...

  32. Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    CoCoGen+ models each federated learning round as a weighted potential game with strategic synthetic data generation and payoff redistribution incentives, showing improved efficiency over baselines under non-IID data a...

  33. Towards Scaling Law Analysis For Spatiotemporal Weather Data

    cs.LG 2026-04 unverdicted novelty 5.0

    Scaling laws for weather models exhibit strong cross-channel and cross-horizon heterogeneity, where globally pooled metrics appear favorable while many individual channels degrade at longer leads.

  34. The Platonic Representation Hypothesis

    cs.LG 2024-05 unverdicted novelty 5.0

    Representations learned by large AI models are converging toward a shared statistical model of reality.

  35. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  36. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  37. Superposition Yields Robust Neural Scaling

    cs.LG 2025-05

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 36 Pith papers · 2 internal anchors

  1. [1]

    Bahdanau, J

    D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Bengio. End-to-end Attention-based Large V ocabu- lary Speech Recognition. arXiv preprint arXiv:1508.04395v2,

  2. [2]

    Battenberg, J

    E. Battenberg, J. Chen, R. Child, A. Coates, Y . Gaur, Y . Li, H. Liu, S. Satheesh, D. Seetapun, A. Sri- ram, and Z. Zhu. Exploring Neural Transducers for End-to-end Speech Recognition. arXiv preprint arXiv:1707.07413,

  3. [3]

    One billion word benchmark for measuring progress in statistical language modeling

    C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005,

  4. [4]

    Hannun, C

    A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep Speech: Scaling Up End-to-End Speech Recognition.arXiv preprint arXiv:1412.5567,

  5. [5]

    Exploring the Limits of Language Modeling

    R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2,

  6. [6]

    ArXiv preprint arXiv:1710.05468 , year=

    K. Kawaguchi, L. P. Kaelbling, and Y . Bengio. Generalization in Deep Learning. arXiv preprint arXiv:1710.05468v1, October

  7. [7]

    Berg, and Li Fei-Fei

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern- stein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575, January

  8. [8]

    Morgan Kaufmann Publishers Inc. R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Subword Units.arXiv preprint arXiv:1508.07909, 2016a. R. Sennrich, B. Haddow, and A. Birch. Edinburgh Neural Machine Translation Systems for WMT

  9. [9]

    arXiv preprint arXiv:1606.02891, 2016b. H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical Mechanics of Learning from Examples. Physical Review A, 45:6056–6091, April

  10. [10]

    S. L. Smith and Q. V . Le. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2, October

  11. [11]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding Deep Learning Requires Rethinking Generalization. arXiv preprint arXiv:1611.03530v2,

  12. [12]

    Similar to word language models, we use normalized cross-entropy loss:− 1 N ∑ ilnpwi, wherepwi is the model’s predicted probability of seeing theith token

    The output space isO =C. Similar to word language models, we use normalized cross-entropy loss:− 1 N ∑ ilnpwi, wherepwi is the model’s predicted probability of seeing theith token. N is either the number of sequences in a batch for training optimization orN is the number of predicted characters in the validation set. A.3 I MAGE CLASSIFICATION ImageNet ima...