Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Andrew Kyle Lampinen; Christopher Potts; Daniel Wurgaft; David Alvarez-Melis; Ekdeep Singh Lubana; Jing Huang; Laura Ruis; Naomi Saphra; Rachit Bansal

arxiv: 2605.29548 · v2 · pith:SNAI4SLSnew · submitted 2026-05-28 · 💻 cs.LG

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Jing Huang , Daniel Wurgaft , Rachit Bansal , Laura Ruis , Naomi Saphra , David Alvarez-Melis , Andrew Kyle Lampinen , Christopher Potts

show 1 more author

Ekdeep Singh Lubana

This is my paper

Pith reviewed 2026-06-29 08:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords model scalingtask interferencerare-task retentioncapacity allocationgradient interferencesynthetic task mixturesOLMo pretraining

0 comments

The pith

Larger models learn rare tasks because they allocate enough capacity to common tasks that gradient updates on those tasks weaken and spare accumulating rare-task features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a phenomenological argument from power-law scaling that larger models can capture parts of the data distribution smaller models miss even with infinite data. It tests this in a controlled mixture of tasks that exhibit monotonic scaling, revealing a competition for neurons in which smaller models devote resources to frequent or simple tasks and therefore fail on rare and complex ones. Larger models escape the bottleneck through reduced interference: by dedicating sufficient capacity to common tasks their gradients on those tasks become weak and no longer overwrite the slowly forming features needed for infrequent tasks. The same pattern appears when pretraining OLMo models from 4M to 4B parameters on tasks that vary in frequency and complexity.

Core claim

Larger models succeed on rare and complex tasks because they allocate enough neurons to high-frequency tasks that the gradient updates for those tasks weaken, leaving the slowly accumulating features of rare tasks intact; smaller models, lacking spare capacity, overwrite those features while optimizing for common tasks.

What carries the argument

The reduced interference mechanism arising from capacity allocation that weakens gradients on frequent tasks.

If this is right

Only models above a certain size learn infrequent and complex tasks in both the synthetic mixture and in OLMo pretraining.
Larger models embed more task-specific features in their hidden representations.
Gradient interference between tasks decreases measurably with scale.
The data-centric competition for neurons explains why scaling improves performance on the tail of the task distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data-mixture design could deliberately boost the frequency of desired rare tasks when model capacity is limited.
The same interference dynamic may appear when scaling other architectures or training objectives beyond language models.
Practical decisions about model size should be informed by the frequency distribution of the tasks one ultimately wants to retain.

Load-bearing premise

The synthetic mixture of tasks with monotonic scaling curves reproduces the resource competition and interference dynamics that occur during real language-model pretraining on natural data.

What would settle it

A measurement showing that gradient interference on rare-task features remains equally strong in large and small models even after the large models have allocated extra capacity to common tasks would falsify the account.

Figures

Figures reproduced from arXiv: 2605.29548 by Andrew Kyle Lampinen, Christopher Potts, Daniel Wurgaft, David Alvarez-Melis, Ekdeep Singh Lubana, Jing Huang, Laura Ruis, Naomi Saphra, Rachit Bansal.

**Figure 1.** Figure 1: Learning a part of the distribution requires model scaling. Compare the loss curves for compute-optimal scaling with the one following an infinite resource regime (labeled asymptotic). The region labeled purple denotes the amount of loss both a smaller model with Ns parameters and a larger model with Nl parameters are able to achieve with respect to a random baseline under finite resources. We call loss r… view at source ↗

**Figure 2.** Figure 2: Feature Utility Predicts Learning Order. We train students of varying width on a mixture of K = 32 regression tasks with power-law task frequencies (β) and plot per-task loss (normalized by mean predictor). (a) Empirical phase diagram where task features (β = 1.0) are retained as a function of width and task frequency, matching our prediction. (b) Loss matches the analytic prediction from Theorem 3 across … view at source ↗

**Figure 3.** Figure 3: Residual Controls Learning. We plot signals encoded in model representations for most frequent and rarest tasks as a function of width N and remaining residual δF. In line with our predictions, we see larger models perfectly capture tasks of all frequencies, while smaller models do not. Meanwhile, even for the largest models, when the residual signal to be explained for frequent tasks is high, rarer task… view at source ↗

**Figure 4.** Figure 4: Rare-Task Retention by Larger Models. We isolate retention by training with a matchedfrequency injection protocol: the rare task is withheld for G steps and then reintroduced in a batch such that its overall frequency is consistent across settings. (a) Training dynamics for G = 1280. We see small models briefly encode the rare task (Norm. signal s˜r: left-y axis) after each injection; specifically, ∆ ˜sr … view at source ↗

**Figure 6.** Figure 6: Behavioral Evidence. (a) Tasks are learned in the order of frequency. Solid lines: We inject the same comparison task (TCMP) at different frequencies and measure the task training loss. Dashed lines: Reference arithmetic tasks observed from pre-training data. (b) With matched-frequency injection of the comparison task (TCMP), i.e., injecting N task instances every N batches, a larger injection gap N degrad… view at source ↗

**Figure 5.** Figure 5: Larger Models Learn Rare Tasks; Smaller Models Do Not. We visualize training loss and test accuracy for the (a) Comparison task (TCMP) and (b) Modular Addition task (TADD). Orange color indicates lower loss/higher accuracy. Overall, we see that increasing width enables learning of lowfrequency tasks, in line with our prior claims. Tasks. We consider two special tasks T: comparison (TCMP) and modular a… view at source ↗

**Figure 7.** Figure 7: Representational Evidence. Scaling model size (width) and increasing task frequency lead to models learning more task-relevant features. Rows correspond to (a) the comparison task TCMP and (b) the modular addition task TADD. The first column shows feature geometry, visualizing the global token order features for TCMP and the Fourier-mode features for TADD. The last two columns quantify how these features s… view at source ↗

**Figure 9.** Figure 9: Gradient Interference. We inject 100 instances of the TCMP task every 100 batches and analyze how batch gradients align with a task reference direction gr. We further decompose the batch gradient into contributions from task tokens and non-task tokens. Top: Cosine similarity between full-batch gradient direction and the task direction gr. Middle: Cosine similarity between batch task gradient direction and … view at source ↗

**Figure 8.** Figure 8: Rare-Task Retention. Larger models can retain the injected task information better, i.e., larger task eval loss drop, when injecting task instances every 100 batches. We now connect the behavioral evidence (Sec. 4.2) and the internal representation account (Sec. 4.3) by analyzing how task gradients interfere with non-task gradients on a set of neurons that implement the task circuit. We focus on TCMP trai… view at source ↗

**Figure 10.** Figure 10: The first MLP layer has the strongest causal effects on the model’s logits prediction. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Competition Dynamics over Neurons. Rare task alignment over training for a softmaxgated model of 1 vs. 2 neurons. (a) Two orthogonal task directions Tf (frequent, sampled with probability 0.9) and Tr (rare, probability 0.1) compete for neurons. (b) With a single neuron, the frequent task dominates; with two neurons, one neuron specializes to each task, allowing rare task alignment to reach and sustain va… view at source ↗

**Figure 12.** Figure 12: Task Spectra. We use power-law task spectra to vary the complexity of a task in our experiments, i.e., the j th feature contributes signal proportional to j −αk for the k th task. While the main paper studies the setting with uniform values for αk, hence making frequency the core knob for varying utility, we now vary complexity by splitting a range of α values; this results in task spectra such that the n… view at source ↗

**Figure 13.** Figure 13: Learning Phases Under Varying Complexity. Reproduction of Fig. 2a under varying task spectra. We see increase in the complexity gap leads to higher emphasis on the top two modes’ learning, since under a power-law spectrum decay, the eigenvalue associated with larger modes will be small. More critically, learning order is now not monotonically predicted by frequency alone: this is most easily visible in th… view at source ↗

**Figure 14.** Figure 14: Feature Utilities Continue to Predict Learning. Reproduction of Fig. 2b under varying task spectra. While [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Complexity Residual. We plot the amount of signal encoded in model representation for most frequent and rarest tasks as a function of width N and remaining residual δF. In line with our predictions, we see larger models perfectly capture tasks of all frequencies, while smaller models do not. Meanwhile, even for the largest models, when the residual signal remaining to explain for frequent tasks is high, r… view at source ↗

**Figure 16.** Figure 16: Complexity Retention Phases. We isolate retention by training with a matched-frequency injection protocol: the lowest-total utility task is withheld for G steps and then reintroduced in a batch such that its overall frequency is consistent across settings. (a) Training dynamics for G = 1280. We see small models briefly encode the rare task (Norm. signal s˜r: left-y axis) after each injection; specifically… view at source ↗

**Figure 17.** Figure 17: Feature Utility Predicts Order of Learning. We extend results from [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Rank-1 Verification of Utility Predicting Learning Order. Left: Per-task subspace alignment ∥PU bk∥ 2 at the end of training as a function of width N and task index k. By our account, we expect tasks 1 . . . N to be retained, while tasks N + 1 . . . K are not retained. The black step segments in the heatmap mark the predicted retention horizon k = N per width. Results align well with our expectations. Rig… view at source ↗

**Figure 19.** Figure 19: Residual Controls Rare-Task Learning. We vary β ∈ {0.5, 1.0, 1.5, 2.0}. Each panel reports the normalized rare-task and most-frequent-task signal as a function of the frequenttask residual δF , with width N encoded by marker brightness (dark = small N, bright = large N; see grayscale colorbar on the right). Dashed vertical line marks the analytic threshold δ ∗ F (Ncrit r ) computed from Corollary 5 under… view at source ↗

**Figure 20.** Figure 20: Per-gap dynamics: Reproducing retention results across different injection gaps and widths. We vary the injection gap G in the set {64, 128, 256, 512, 1024, 1280} (top to bottom) and reproduce the results shown in Fig. 4a for widths N ∈ {32, 96, 128, 192, 256}. In each cell, the left y-axis reports the normalized rare-task signal s˜r(Ut), while the right y-axis (gray) is the gain / decay curves reporting … view at source ↗

**Figure 21.** Figure 21: Persistence of the multi-rank phase diagram at 1M steps. Per-task normalized loss ℓk/ℓk,baseline versus training step (log-x, linear-y) for six widths N ∈ {8, 16, 32, 64, 128, 256}; ℓk,baseline = ∥ak∥ 2/Dt is the mean-predictor MSE per task. Tasks colored by index from orange (k = 1, most frequent) to purple (k = 32, rarest); vertical dotted line marks the training budget used in main paper, i.e., 100K st… view at source ↗

**Figure 22.** Figure 22: Task loss vs. general language modeling loss. [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: TCMP task eval loss vs. compute by model size. Dashed black line shows the computeoptimal frontier. In [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

read the original abstract

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down a reduced gradient interference mechanism where larger models weaken common-task updates enough to let rare-task features accumulate, shown in both synthetic mixtures and OLMo runs, though the leap to natural entangled data remains the open question.

read the letter

The main takeaway is that larger models learn rare and complex tasks because they can park enough capacity on common ones that the resulting gradients become weak and stop overwriting the slow progress on the tail. The synthetic mixtures and the OLMo pretraining runs (4M to 4B) are set up to make this visible through direct measurements of neuron allocation, feature embedding, and cross-task gradient interference.

What stands out as new is the concrete data-centric story: smaller models route resources to high-frequency or low-complexity tasks even when solutions for the rare ones exist, while larger models reduce that interference once capacity thresholds are crossed. The OLMo experiments add weight by inserting controlled novel tasks into real pretraining and showing the same pattern in representation quality and gradient behavior.

The experiments are the strongest part. They are direct, use monotonic scaling curves by design, and include checks on both frequency and complexity. That gives a testable account rather than another scaling observation.

The soft spot is the gap between the controlled setups and natural pretraining. The synthetic tasks are cleanly separable; web-scale data has rare phenomena statistically entangled with common tokens, which could alter the gradient structure and allocation dynamics. The opening power-law argument is mainly motivational and does not do heavy lifting.

This is worth bringing to a scaling or data-mixture reading group. Readers who care about why size helps with tail performance or how to adjust training mixtures will find the mechanism and the OLMo measurements useful. It deserves peer review because the experiments are reproducible and the central claim can be probed with similar controls, even if the generalization step needs more work.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that larger models succeed on rare and complex tasks that smaller models fail to learn because they allocate sufficient capacity to common tasks, weakening the associated gradient updates and thereby reducing interference that would otherwise overwrite slowly accumulating rare-task features. This data-centric account is motivated by a phenomenological argument based on power-law scaling and is supported by synthetic experiments on task mixtures engineered to exhibit monotonic scaling curves, plus OLMo pretraining runs (4M to 4B parameters) that insert novel tasks of controlled frequency and complexity, where larger models are shown to embed more task features and exhibit less gradient interference.

Significance. If the reduced-interference mechanism holds, the work supplies a concrete explanation for why scaling improves performance on the tail of the data distribution and could inform practical choices about model size and data mixtures. Credit is due for the controlled synthetic setups that isolate resource competition and for the direct scaling experiments with OLMo models that provide reproducible evidence rather than relying solely on post-hoc analysis of existing checkpoints.

major comments (2)

[Synthetic experiments section] Synthetic experiments section: the claim that the observed monotonic scaling and reduced interference explain larger-model behavior on natural data rests on the assumption that tasks can be treated as discrete, separable components whose resource competition is directly measurable; this is load-bearing for the central extrapolation because natural web-scale distributions entangle rare phenomena with common tokens, which may produce different gradient structures and neuron allocations.
[OLMo pretraining experiments] OLMo pretraining experiments: inserting novel tasks of controlled frequency/complexity into the training mixture demonstrates the pattern but does not establish that the same reduced-interference dynamic governs the gradient updates arising from the statistically entangled rare events already present in the base corpus; this distinction is load-bearing for the claim that the mechanism accounts for scaling success in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to clarify limitations.

read point-by-point responses

Referee: [Synthetic experiments section] Synthetic experiments section: the claim that the observed monotonic scaling and reduced interference explain larger-model behavior on natural data rests on the assumption that tasks can be treated as discrete, separable components whose resource competition is directly measurable; this is load-bearing for the central extrapolation because natural web-scale distributions entangle rare phenomena with common tokens, which may produce different gradient structures and neuron allocations.

Authors: The synthetic experiments are designed to isolate the interference mechanism by using discrete, controllable tasks, enabling direct measurement of capacity allocation, gradient magnitudes, and feature accumulation that would be difficult to disentangle in natural data. This controlled isolation is what allows us to identify the reduced-interference dynamic as a candidate explanation. The OLMo experiments then test the same pattern when tasks are inserted into a real corpus. We will add a dedicated limitations paragraph in the discussion section acknowledging that entanglement in natural distributions may alter gradient structures and that the synthetic results therefore provide mechanistic insight rather than a complete account of all scaling behaviors on web data. revision: partial
Referee: [OLMo pretraining experiments] OLMo pretraining experiments: inserting novel tasks of controlled frequency/complexity into the training mixture demonstrates the pattern but does not establish that the same reduced-interference dynamic governs the gradient updates arising from the statistically entangled rare events already present in the base corpus; this distinction is load-bearing for the claim that the mechanism accounts for scaling success in practice.

Authors: Inserting novel tasks provides the necessary experimental control to vary frequency and complexity independently while holding the base corpus fixed, allowing us to measure interference and feature embedding directly. This setup demonstrates that the mechanism operates even when tasks are added to a realistic mixture. We agree that it does not directly measure interference among already-entangled rare events native to the corpus. We will revise the abstract, introduction, and conclusion to qualify the claims as applying to controlled insertions and to note the open question of native rare-event dynamics as an important direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances a phenomenological argument from existing power-law scaling observations, then validates the proposed reduced-interference mechanism through independent synthetic task-mixture experiments and new OLMo pretraining runs (4M–4B) that insert controlled novel tasks. These empirical setups are constructed and measured separately from any prior fitted quantities or self-citations; the central claims rest on the outcomes of these fresh experiments rather than reducing by construction to inputs via the paper's equations or load-bearing self-references. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that power-law scaling implies larger models capture previously unlearned distribution mass, plus the modeling choice that the synthetic task mixture reproduces real pretraining dynamics.

axioms (1)

domain assumption Power-law scaling of model performance already implies larger models learn parts of the data distribution smaller models fail to learn even with infinite data
Invoked in the opening phenomenological argument to motivate the capacity effect.

pith-pipeline@v0.9.1-grok · 5861 in / 1352 out tokens · 36420 ms · 2026-06-29T08:37:08.631363+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neuron Populations Exhibit Divergent Selectivity with Scale
cs.LG 2026-06 unverdicted novelty 5.0

Rosetta Neurons in language models up to 30B and vision models up to 5B parameters scale sublinearly with size while becoming more selective and monosemantic.

Reference graph

Works this paper leans on

146 extracted references · 40 canonical work pages · cited by 1 Pith paper · 22 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

System Card: Claude Mythos Preview, 2026

Anthropic. System Card: Claude Mythos Preview, 2026. https://www-cdn.anthropic. com/08ab9158070959f88f296514c21b7facce6f52bc.pdf

2026
[3]

Gemini 3 Pro - Model Card, 2026

Google DeepMind. Gemini 3 Pro - Model Card, 2026. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

2026
[4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

DeepSeek-V4-Pro, 2026

DeepSeek-AI. DeepSeek-V4-Pro, 2026. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026
[6]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Measuring AI ability to complete long software tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring AI ability to complete long software tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[8]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

ARC-AGI-3, 2026.https://arcprize.org/arc-agi/3

ARC Prize Foundation. ARC-AGI-3, 2026.https://arcprize.org/arc-agi/3

2026
[10]

Terminal-bench: Bench- marking agents on hard, realistic tasks in command line interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Bench- marking agents on hard, realistic tasks in command line interfaces. InThe Fourteenth Inter- national Conference on Learning Representations, 2026. URL https://openreview.net/ for...

2026
[11]

Responsible Scaling Policy, 2026

Anthropic. Responsible Scaling Policy, 2026. https://www.anthropic.com/ responsible-scaling-policy

2026
[12]

Our Approach to Frontier Risk, 2023

OpenAI. Our Approach to Frontier Risk, 2023. https://openai.com/global-affairs/ our-approach-to-frontier-risk/

2023
[13]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Predicting emergent abilities with infinite resolution evaluation

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. Predicting emergent abilities with infinite resolution evaluation. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=lDbjooxLkD

2024
[15]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openr...

2022
[16]

137 emergent abilities of large language models, 2022

Jason Wei. 137 emergent abilities of large language models, 2022. https://www.jasonwei. net/blog/emergence. 11

2022
[17]

arXiv preprint arXiv:2307.15936 , year=

Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023
[18]

Understanding emergent abilities of language models from the loss perspective

Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=35DAviqMFo

2024
[19]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023

work page arXiv 2023
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=ryenvpEKDr

2020
[23]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee- woo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[24]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems, 35:22300– 22312, 2022

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems, 35:22300– 22312, 2022

2022
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,
[29]

URLhttps://openreview.net/forum?id=iBBcRUlOAPR
[30]

Reconciling kaplan and chinchilla scaling laws.Transactions on Machine Learning Research, 2024

Tim Pearce and Jinyeop Song. Reconciling kaplan and chinchilla scaling laws.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/ forum?id=NLoaLyuUUF

2024
[31]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=nbOY1OmtRc

2024
[32]

How feature learning can improve neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,
[33]

URLhttps://openreview.net/forum?id=dEypApI1MZ. 12
[34]

Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

2024
[35]

Kakade, Peter L

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason Lee. Scaling laws in linear regression: Compute, parameters, and data. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 60556–60606. Curran Associates, Inc., 2024. doi: 10. 5...

2024
[36]

The quantization model of neural scaling

Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 28699–28722. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/file/5b6...

2023
[37]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022
[38]

Dick, and Hidenori Tanaka

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, and Hidenori Tanaka. A percolation model of emergence: Analyzing transformers trained on a formal language. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=0pLCDJVVRD

2025
[39]

Learning curves theory for hierar- chically compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hierar- chically compositional data with power-law distributed features. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id= Lw0kC75dY0

2025
[40]

Scaling laws and representation learning in simple hierarchical languages: Transformers vs

Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart. Scaling laws and representation learning in simple hierarchical languages: Transformers vs. convolutional architectures.arXiv preprint arXiv:2505.07070, 2025

work page arXiv 2025
[41]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026
[42]

Pareto frontiers in deep feature learning: Data, compute, width, and luck.Advances in Neural Information Processing Systems, 36:48021–48034, 2023

Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Pareto frontiers in deep feature learning: Data, compute, width, and luck.Advances in Neural Information Processing Systems, 36:48021–48034, 2023

2023
[43]

Tulu 3: Pushing frontiers in open language model post-training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=i1uGbfHHpH

2025
[44]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

OpenAI Claims DeepSeek Distilled US Models to Gain an Edge, 2026

Bloomberg. OpenAI Claims DeepSeek Distilled US Models to Gain an Edge, 2026. https://www.bloomberg.com/news/articles/2026-02-12/ openai-accuses-deepseek-of-distilling-us-models-to-gain-an-edge?

2026
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, , et al

Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, , et al. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=...

2025
[48]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,
[49]

URLhttps://openreview.net/forum?id=3zKtaqxLhW
[50]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data

Yunhao Tang, Sid Wang, Lovish Madaan, and Remi Munos. Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=pc6M9h3T9m

2026
[52]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Reduce, reuse, recycle: Improving training efficiency with distillation.arXiv preprint arXiv:2211.00683, 2022

Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong, and Matthew L Leav- itt. Reduce, reuse, recycle: Improving training efficiency with distillation.arXiv preprint arXiv:2211.00683, 2022

work page arXiv 2022
[54]

Scaling collapse reveals universal dynamics in compute-optimally trained neural networks

Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, and Atish Agarwala. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks. InForty-second International Conference on Machine Learning, 2025. URL https:// openreview.net/forum?id=Fvq9ogLnLN

2025
[55]

4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024

Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024

2024
[56]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 Furious, 2024. URL https://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Yedi Zhang, Andrew M Saxe, and Peter E. Latham. Saddle-to-saddle dynamics explains a simplicity bias across neural network architectures. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= Vit5M0G5Gb

2026
[58]

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023

2023
[59]

Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity

Arthur Jacot, François Ged, Berfin ¸ Sim¸ sek, Clément Hongler, and Franck Gabriel. Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021

work page arXiv 2021
[60]

Alternating gradient flows: A theory of feature learning in two-layer neural networks

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B Simon, Michael R DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A theory of feature learning in two-layer neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum? id=t7LKc0MMW6

2026
[61]

Measuring forgetting of memorized training examples

Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=7bJizxLKrR

2023
[62]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA, August 2019. USENIX 14 Association. ISBN 978-1-939133-06-9. URL https://www.usenix.org/conference/ usenixsecur...

2019
[63]

Demystifying verbatim memorization in large language models

Jing Huang, Diyi Yang, and Christopher Potts. Demystifying verbatim memorization in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10711–10732, 2024

2024
[64]

Gummadi, Willie Neiswanger, and Robin Jia

Johnny Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Yixiang Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, and Robin Jia. Hubble: a model suite to advance the study of LLM memorization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=ZfdnZhOP0k

2026
[65]

Intrinsic task symmetry drives generalization in algorith- mic tasks, 2026

Hyeonbin Hwang and Yeachan Park. Intrinsic task symmetry drives generalization in algorith- mic tasks, 2026. URLhttps://arxiv.org/abs/2603.01968

work page arXiv 2026
[66]

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.arXiv preprint,

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.arXiv preprint,
[67]

URLhttps://huggingface.co/datasets/allenai/dolma
[68]

Extrinsic evaluation of cultural competence in large language models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. OLMo: Accelerating the science of language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15789–15809, Bangkok, Thaila...

work page doi:10.18653/v1/2024 2024
[69]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Confer- ence on Learning Representations, sep 2022. URL https://openreview.net/forum?id= 9XFSbDPmdW

2022
[71]

Pre-trained large language models use fourier features to compute addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. InThe Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. URL https://openreview.net/forum?id= i4MutM2TZb

2024
[72]

Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts,

Sheridan Feucht, Tal Haklay, Usha Bhalla, Daniel Wurgaft, Can Rager, Raphaël Sarfati, Jack Merullo, Thomas McGrath, Owen Lewis, Ekdeep Singh Lubana, Thomas Fel, and Atticus Geiger. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts,
[73]

URLhttps://arxiv.org/abs/2605.01148

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Finding alignments between interpretable causal variables and distributed neural representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Francesco Locatello and Vanessa Didelez, editors,Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 ofProceedings of Machine Learning Research, p...

2024
[75]

Does learning require memorization? a short tale about a long tail

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 954–959, 2020

2020
[76]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 15

2020
[77]

Data selection for language models via importance resampling.Advances in Neural Information Processing Systems, 36:34201–34227, 2023

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling.Advances in Neural Information Processing Systems, 36:34201–34227, 2023

2023
[78]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36, 2024

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36, 2024

2024
[79]

Stanford University, 2024

Sang Michael Xie.Foundation Models from a Data-Distributional View. Stanford University, 2024

2024
[80]

A picture of the space of typical learnable tasks.arXiv preprint arXiv:2210.17011, 2022

Rahul Ramesh, Jialin Mao, Itay Griniasty, Rubing Yang, Han Kheng Teoh, Mark Transtrum, James P Sethna, and Pratik Chaudhari. A picture of the space of typical learnable tasks.arXiv preprint arXiv:2210.17011, 2022

work page arXiv 2022

Showing first 80 references.

[1] [1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

System Card: Claude Mythos Preview, 2026

Anthropic. System Card: Claude Mythos Preview, 2026. https://www-cdn.anthropic. com/08ab9158070959f88f296514c21b7facce6f52bc.pdf

2026

[3] [3]

Gemini 3 Pro - Model Card, 2026

Google DeepMind. Gemini 3 Pro - Model Card, 2026. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

2026

[4] [4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

DeepSeek-V4-Pro, 2026

DeepSeek-AI. DeepSeek-V4-Pro, 2026. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026

[6] [6]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Measuring AI ability to complete long software tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring AI ability to complete long software tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[8] [8]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

ARC-AGI-3, 2026.https://arcprize.org/arc-agi/3

ARC Prize Foundation. ARC-AGI-3, 2026.https://arcprize.org/arc-agi/3

2026

[10] [10]

Terminal-bench: Bench- marking agents on hard, realistic tasks in command line interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Bench- marking agents on hard, realistic tasks in command line interfaces. InThe Fourteenth Inter- national Conference on Learning Representations, 2026. URL https://openreview.net/ for...

2026

[11] [11]

Responsible Scaling Policy, 2026

Anthropic. Responsible Scaling Policy, 2026. https://www.anthropic.com/ responsible-scaling-policy

2026

[12] [12]

Our Approach to Frontier Risk, 2023

OpenAI. Our Approach to Frontier Risk, 2023. https://openai.com/global-affairs/ our-approach-to-frontier-risk/

2023

[13] [13]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Predicting emergent abilities with infinite resolution evaluation

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. Predicting emergent abilities with infinite resolution evaluation. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=lDbjooxLkD

2024

[15] [15]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openr...

2022

[16] [16]

137 emergent abilities of large language models, 2022

Jason Wei. 137 emergent abilities of large language models, 2022. https://www.jasonwei. net/blog/emergence. 11

2022

[17] [17]

arXiv preprint arXiv:2307.15936 , year=

Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023

[18] [18]

Understanding emergent abilities of language models from the loss perspective

Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=35DAviqMFo

2024

[19] [19]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023

work page arXiv 2023

[20] [20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=ryenvpEKDr

2020

[23] [23]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee- woo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[24] [24]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems, 35:22300– 22312, 2022

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems, 35:22300– 22312, 2022

2022

[27] [27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

[29] [29]

URLhttps://openreview.net/forum?id=iBBcRUlOAPR

[30] [30]

Reconciling kaplan and chinchilla scaling laws.Transactions on Machine Learning Research, 2024

Tim Pearce and Jinyeop Song. Reconciling kaplan and chinchilla scaling laws.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/ forum?id=NLoaLyuUUF

2024

[31] [31]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=nbOY1OmtRc

2024

[32] [32]

How feature learning can improve neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,

[33] [33]

URLhttps://openreview.net/forum?id=dEypApI1MZ. 12

[34] [34]

Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

2024

[35] [35]

Kakade, Peter L

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason Lee. Scaling laws in linear regression: Compute, parameters, and data. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 60556–60606. Curran Associates, Inc., 2024. doi: 10. 5...

2024

[36] [36]

The quantization model of neural scaling

Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 28699–28722. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/file/5b6...

2023

[37] [37]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022

[38] [38]

Dick, and Hidenori Tanaka

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, and Hidenori Tanaka. A percolation model of emergence: Analyzing transformers trained on a formal language. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=0pLCDJVVRD

2025

[39] [39]

Learning curves theory for hierar- chically compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hierar- chically compositional data with power-law distributed features. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id= Lw0kC75dY0

2025

[40] [40]

Scaling laws and representation learning in simple hierarchical languages: Transformers vs

Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart. Scaling laws and representation learning in simple hierarchical languages: Transformers vs. convolutional architectures.arXiv preprint arXiv:2505.07070, 2025

work page arXiv 2025

[41] [41]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026

[42] [42]

Pareto frontiers in deep feature learning: Data, compute, width, and luck.Advances in Neural Information Processing Systems, 36:48021–48034, 2023

Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Pareto frontiers in deep feature learning: Data, compute, width, and luck.Advances in Neural Information Processing Systems, 36:48021–48034, 2023

2023

[43] [43]

Tulu 3: Pushing frontiers in open language model post-training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=i1uGbfHHpH

2025

[44] [44]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

OpenAI Claims DeepSeek Distilled US Models to Gain an Edge, 2026

Bloomberg. OpenAI Claims DeepSeek Distilled US Models to Gain an Edge, 2026. https://www.bloomberg.com/news/articles/2026-02-12/ openai-accuses-deepseek-of-distilling-us-models-to-gain-an-edge?

2026

[46] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, , et al

Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, , et al. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=...

2025

[48] [48]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,

[49] [49]

URLhttps://openreview.net/forum?id=3zKtaqxLhW

[50] [50]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data

Yunhao Tang, Sid Wang, Lovish Madaan, and Remi Munos. Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=pc6M9h3T9m

2026

[52] [52]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Reduce, reuse, recycle: Improving training efficiency with distillation.arXiv preprint arXiv:2211.00683, 2022

Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong, and Matthew L Leav- itt. Reduce, reuse, recycle: Improving training efficiency with distillation.arXiv preprint arXiv:2211.00683, 2022

work page arXiv 2022

[54] [54]

Scaling collapse reveals universal dynamics in compute-optimally trained neural networks

Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, and Atish Agarwala. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks. InForty-second International Conference on Machine Learning, 2025. URL https:// openreview.net/forum?id=Fvq9ogLnLN

2025

[55] [55]

4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024

Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024

2024

[56] [56]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 Furious, 2024. URL https://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Yedi Zhang, Andrew M Saxe, and Peter E. Latham. Saddle-to-saddle dynamics explains a simplicity bias across neural network architectures. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= Vit5M0G5Gb

2026

[58] [58]

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023

2023

[59] [59]

Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity

Arthur Jacot, François Ged, Berfin ¸ Sim¸ sek, Clément Hongler, and Franck Gabriel. Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021

work page arXiv 2021

[60] [60]

Alternating gradient flows: A theory of feature learning in two-layer neural networks

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B Simon, Michael R DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A theory of feature learning in two-layer neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum? id=t7LKc0MMW6

2026

[61] [61]

Measuring forgetting of memorized training examples

Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=7bJizxLKrR

2023

[62] [62]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA, August 2019. USENIX 14 Association. ISBN 978-1-939133-06-9. URL https://www.usenix.org/conference/ usenixsecur...

2019

[63] [63]

Demystifying verbatim memorization in large language models

Jing Huang, Diyi Yang, and Christopher Potts. Demystifying verbatim memorization in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10711–10732, 2024

2024

[64] [64]

Gummadi, Willie Neiswanger, and Robin Jia

Johnny Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Yixiang Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, and Robin Jia. Hubble: a model suite to advance the study of LLM memorization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=ZfdnZhOP0k

2026

[65] [65]

Intrinsic task symmetry drives generalization in algorith- mic tasks, 2026

Hyeonbin Hwang and Yeachan Park. Intrinsic task symmetry drives generalization in algorith- mic tasks, 2026. URLhttps://arxiv.org/abs/2603.01968

work page arXiv 2026

[66] [66]

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.arXiv preprint,

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.arXiv preprint,

[67] [67]

URLhttps://huggingface.co/datasets/allenai/dolma

[68] [68]

Extrinsic evaluation of cultural competence in large language models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. OLMo: Accelerating the science of language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15789–15809, Bangkok, Thaila...

work page doi:10.18653/v1/2024 2024

[69] [69]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[70] [70]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Confer- ence on Learning Representations, sep 2022. URL https://openreview.net/forum?id= 9XFSbDPmdW

2022

[71] [71]

Pre-trained large language models use fourier features to compute addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. InThe Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. URL https://openreview.net/forum?id= i4MutM2TZb

2024

[72] [72]

Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts,

Sheridan Feucht, Tal Haklay, Usha Bhalla, Daniel Wurgaft, Can Rager, Raphaël Sarfati, Jack Merullo, Thomas McGrath, Owen Lewis, Ekdeep Singh Lubana, Thomas Fel, and Atticus Geiger. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts,

[73] [73]

URLhttps://arxiv.org/abs/2605.01148

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

Finding alignments between interpretable causal variables and distributed neural representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Francesco Locatello and Vanessa Didelez, editors,Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 ofProceedings of Machine Learning Research, p...

2024

[75] [75]

Does learning require memorization? a short tale about a long tail

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 954–959, 2020

2020

[76] [76]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 15

2020

[77] [77]

Data selection for language models via importance resampling.Advances in Neural Information Processing Systems, 36:34201–34227, 2023

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling.Advances in Neural Information Processing Systems, 36:34201–34227, 2023

2023

[78] [78]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36, 2024

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36, 2024

2024

[79] [79]

Stanford University, 2024

Sang Michael Xie.Foundation Models from a Data-Distributional View. Stanford University, 2024

2024

[80] [80]

A picture of the space of typical learnable tasks.arXiv preprint arXiv:2210.17011, 2022

Rahul Ramesh, Jialin Mao, Itay Griniasty, Rubing Yang, Han Kheng Teoh, Mark Transtrum, James P Sethna, and Pratik Chaudhari. A picture of the space of typical learnable tasks.arXiv preprint arXiv:2210.17011, 2022

work page arXiv 2022