arxiv: 2605.10237 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

Dan Mikulincer, Elchanan Mossel, Elisabetta Cornacchia

Pith reviewed 2026-05-12 05:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords k-juntastemporal correlationsstochastic gradient descentrandom walks on hypercubesparse learningtwo-layer ReLU networksBoolean function learning

0 comments

The pith

Lazy random walk samples let SGD learn Boolean k-juntas with sample complexity linear in the ambient dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Boolean k-juntas, sparse functions depending on only k relevant input bits, resist efficient learning by gradient methods when samples are drawn independently. When instead the samples follow a lazy random walk on the hypercube, consecutive examples become temporally correlated. A two-layer ReLU network trained by stylized SGD can exploit those correlations through a loss that penalizes mismatches in the change from one sample to the next. For any fixed k the required number of samples grows only linearly with the total dimension d. Standard large-batch gradient descent using ordinary pointwise losses gains no comparable benefit from the same correlations.

Core claim

A two-layer ReLU network trained using stylized-SGD with a temporal-difference loss, which compares target and predicted increments across consecutive samples generated by a lazy random walk on the hypercube, learns every fixed k-junta with sample complexity essentially linear in the ambient dimension d. By contrast, large-batch gradient methods using standard convex pointwise losses do not obtain the same advantage from temporal correlations.

What carries the argument

The temporal-difference loss that compares target and predicted increments across consecutive samples from the lazy random walk, allowing the optimizer to use the built-in dependencies between successive points.

Load-bearing premise

The input samples are generated exactly by a lazy random walk on the hypercube and training uses stylized-SGD with the specific temporal-difference loss that compares increments across consecutive samples.

What would settle it

Replace the lazy random walk with independent uniform samples while keeping the same network, loss, and optimizer; the sample complexity should jump from linear in d to super-linear or exponential in d for fixed k.

Figures

Figures reproduced from arXiv: 2605.10237 by Dan Mikulincer, Elchanan Mossel, Elisabetta Cornacchia.

**Figure 1.** Figure 1: 5-parity with d = 50. We train a 4-layer MLP with batch size 1, learning rate 0.005, TD parameter α = 0.9, and random-walk flip probability p = 0.9. Left: test accuracy for random-walk data with TD loss, compared with i.i.d. data using either TD loss or square loss. Right: selected Fourier-Walsh coefficients of the learned predictor for random-walk data with TD loss. 9 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 2.** Figure 2: f(x) = 1 2 x1x2...x5(1 + x6 + x7 − x6x7) with d = 50. We train a 4-layer MLP with batch size 1, learning rate 0.005, TD parameter α = 0.9, and random-walk flip probability p = 0.9. Left: test accuracy for random-walk data with TD loss, compared with i.i.d. data using square loss. Right: selected Fourier-Walsh coefficients of the learned predictor for random-walk data with TD loss. 0 50000 100000 150000 200… view at source ↗

**Figure 3.** Figure 3: 5-parity with d = 50. We train a 4-layer MLP with batch size 1, learning rate 0.005, and random-walk flip probability p = 0.9. Left: test accuracy for random-walk data using TD loss versus square loss. Right: selected Fourier-Walsh coefficients of the learned predictor for random-walk data with square loss. 7 Conclusion In this work, we studied how temporal correlations can fundamentally change the complex… view at source ↗

read the original abstract

We study how temporal correlations in the data can make certain sparse learning problems efficiently learnable by gradient-based methods. Our focus is on Boolean k-juntas, a canonical sparse learning problem known to pose barriers for gradient-based methods under independent uniform samples. We show that this picture changes when the samples are generated by a lazy random walk on the hypercube. In this setting, the temporal dependencies can be exploited by a two-layer ReLU network trained using stylized-SGD with a temporal-difference loss, which compares target and predicted increments across consecutive samples. For every fixed k, the resulting sample complexity is essentially linear in the ambient dimension d. By contrast, we show that for large-batch gradient methods using standard convex pointwise losses, temporal correlations do not provide the same advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Temporal correlations from lazy random walks help SGD learn fixed-k juntas with linear-in-d samples, but only when using a custom temporal-difference loss instead of standard pointwise losses.

read the letter

The main point is that this paper identifies a concrete setting where temporal structure in the data lets gradient methods bypass the usual hardness for learning k-juntas. With samples from a lazy random walk on the hypercube and a two-layer ReLU network trained by stylized SGD on a temporal-difference loss that compares increments, they get sample complexity linear in d for any fixed k. The same data distribution gives no such gain under large-batch gradient descent with ordinary convex losses that treat each point independently.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that Boolean k-juntas, which are hard for gradient methods under i.i.d. uniform samples, become efficiently learnable when data is generated by a lazy random walk on the hypercube. Specifically, a two-layer ReLU network trained by stylized-SGD on a temporal-difference loss that compares predicted and observed increments (y_{t+1} - y_t) between consecutive walk steps achieves sample complexity essentially linear in ambient dimension d for any fixed k. By contrast, the same data distribution yields no such advantage for large-batch gradient descent on standard convex pointwise losses.

Significance. If the central claims hold, the work isolates a concrete mechanism by which temporal correlations can be exploited to bypass known barriers for sparse learning, while explicitly demonstrating that standard pointwise losses do not capture the same benefit. This provides a positive example of loss design tailored to data structure and could inform algorithm development for correlated or sequential data settings.

major comments (2)

[Abstract and standard-losses section] Abstract and § on standard losses: the claim that temporal correlations provide no advantage for large-batch GD on convex pointwise losses is load-bearing for the paper's contrast; the manuscript must specify the exact batch sizes, the precise convex losses considered, and the lower-bound argument showing that the single-coordinate flip structure cannot be exploited without the TD increment comparison.
[Main theorem] Main theorem on linear sample complexity: the result is stated for stylized-SGD with the specific TD loss; the proof must make explicit any assumptions on the walk's laziness parameter, the network width, and the step-size schedule, because these directly determine whether the linear-in-d scaling survives when k is fixed but the constants are tracked.

minor comments (2)

[Introduction] The definition of the temporal-difference loss should be stated with equation number in the introduction so that the distinction from pointwise losses is immediate.
[Preliminaries] Notation for the lazy random walk (transition probabilities, laziness parameter) should be fixed early and used consistently in all statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and for highlighting points that will improve the clarity and rigor of the manuscript. We address each major comment below and have revised the paper to incorporate the requested details.

read point-by-point responses

Referee: [Abstract and standard-losses section] Abstract and § on standard losses: the claim that temporal correlations provide no advantage for large-batch GD on convex pointwise losses is load-bearing for the paper's contrast; the manuscript must specify the exact batch sizes, the precise convex losses considered, and the lower-bound argument showing that the single-coordinate flip structure cannot be exploited without the TD increment comparison.

Authors: We agree that the contrast with standard losses is central and that the current presentation leaves the precise setting underspecified. In the revised manuscript we explicitly state: (i) batch sizes of size Θ(d log d) for the large-batch regime, (ii) the convex losses considered (squared loss and logistic loss), and (iii) a self-contained lower-bound argument showing that any gradient method on these pointwise losses cannot exploit the single-coordinate flip structure of the lazy walk without the temporal-difference comparison. The argument proceeds by exhibiting a distribution over k-juntas for which the expected gradient on any fixed coordinate remains O(1/d) even after polynomially many walk steps, precluding linear-in-d sample complexity. These additions appear in the updated abstract and in a new subsection of the standard-losses section. revision: yes
Referee: [Main theorem] Main theorem on linear sample complexity: the result is stated for stylized-SGD with the specific TD loss; the proof must make explicit any assumptions on the walk's laziness parameter, the network width, and the step-size schedule, because these directly determine whether the linear-in-d scaling survives when k is fixed but the constants are tracked.

Authors: We thank the referee for this observation. The revised proof section now states the assumptions explicitly: the laziness parameter is fixed at 1−1/d, the two-layer ReLU network has width O(d), and the step-size schedule is η_t = Θ(1/(t + d)). Under these choices the sample complexity remains O(d · poly(k, log d)) with constants depending only on k (not on d). We also include a short remark showing that the linear-in-d scaling is preserved for any constant laziness parameter bounded away from 1 and for any width polynomial in d, provided the step-size is appropriately rescaled. These clarifications are added to the statement of the main theorem and to the proof of the convergence lemma. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit TD loss and random-walk dynamics yield linear complexity via direct analysis

full rationale

The paper defines a concrete training procedure (two-layer ReLU net + stylized-SGD on temporal-difference loss comparing increments y_{t+1}-y_t) and derives the linear-in-d sample complexity for fixed-k juntas directly from the lazy random-walk transition structure on the hypercube. The same data distribution is shown not to yield the advantage under standard pointwise convex losses, confirming that the result is not smuggled in by definition or self-citation but follows from the explicit coupling of loss and data process. No fitted parameters are renamed as predictions, no uniqueness theorems are imported from prior self-work, and the derivation chain remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that data follows a lazy random walk and on the choice of the temporal-difference loss; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Samples are generated by a lazy random walk on the hypercube
This is the data-generating process stated in the abstract as the setting that creates exploitable temporal correlations.

pith-pipeline@v0.9.0 · 5437 in / 1209 out tokens · 33800 ms · 2026-05-12T05:22:35.164359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean neutral8 / flipAt512 echoes
temporal-difference loss ... compares target and predicted increments across consecutive samples ... second moment distinguishes relevant and irrelevant coordinates
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
For every fixed k, the resulting sample complexity is essentially linear in the ambient dimension d

Reference graph

Works this paper leans on

164 extracted references · 164 canonical work pages · 4 internal anchors

[1]

IEEE Transactions on Information Theory , year =

Bresler, Guy and Gamarnik, David and Shah, Devavrat , title =. IEEE Transactions on Information Theory , year =

work page
[2]

arXiv , year =

Gaitonde, Jason and Moitra, Ankur and Mossel, Elchanan , title =. arXiv , year =

work page
[3]

and Suzuki, Taiji and Wang, Zhichao and Wu, Denny and Yang, Greg , keywords =

Ba, Jimmy and Erdogdu, Murat A. and Suzuki, Taiji and Wang, Zhichao and Wu, Denny and Yang, Greg , keywords =. High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.01445 , url =

work page doi:10.48550/arxiv.2205.01445 2022
[4]

What can linearized neural networks actually say about generalization? , publisher =

Ortiz-Jiménez, Guillermo and Moosavi-Dezfooli, Seyed-Mohsen and Frossard, Pascal , keywords =. What can linearized neural networks actually say about generalization? , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.06770 , url =

work page doi:10.48550/arxiv.2106.06770 2021
[5]

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , publisher =

Liu, Rosanne and Lehman, Joel and Molino, Piero and Such, Felipe Petroski and Frank, Eric and Sergeev, Alex and Yosinski, Jason , keywords =. An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1807.03247 , url =

work page doi:10.48550/arxiv.1807.03247 2018
[6]

International Conference on Learning Representations , year=

Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets? , author=. International Conference on Learning Representations , year=

work page
[7]

arXiv preprint arXiv:2010.01369 , year=

Computational separation between convolutional and fully-connected networks , author=. arXiv preprint arXiv:2010.01369 , year=

work page arXiv 2010
[8]

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? , publisher =

Mok, Jisoo and Na, Byunggook and Kim, Ji-Hoon and Han, Dongyoon and Yoon, Sungroh , keywords =. Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2203.14577 , url =

work page doi:10.48550/arxiv.2203.14577 2022
[9]

arXiv preprint arXiv:2011.06006 , year=

Park, Daniel S. and Lee, Jaehoon and Peng, Daiyi and Cao, Yuan and Sohl-Dickstein, Jascha , keywords =. Towards NNGP-guided Neural Architecture Search , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2011.06006 , url =

work page doi:10.48550/arxiv.2011.06006 2020
[10]

S. J. Montgomery-Smith , journal =. The Distribution of Rademacher Sums , volume =

work page
[11]

Analysis of Boolean Functions , DOI=

O'Donnell, Ryan , year=. Analysis of Boolean Functions , DOI=

work page
[12]

On the universality of deep learning , volume =

Abbe, Emmanuel and Sandon, Colin , booktitle =. On the universality of deep learning , volume =

work page
[13]

ArXiv , year=

Moments and Absolute Moments of the Normal Distribution , author=. ArXiv , year=

work page
[14]

Advances in Neural Information Processing Systems , volume=

The staircase property: How hierarchical structure can guide deep learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Is Deeper Better only when Shallow is Good? , volume =

Malach, Eran and Shalev-Shwartz, Shai , booktitle =. Is Deeper Better only when Shallow is Good? , volume =

work page
[16]

Backward feature correction:

Allen-Zhu, Zeyuan and Li, Yuanzhi , note=. Backward feature correction:

work page
[17]

Weakly learning

Blum, Avrim and Furst, Merrick and Jackson, Jeffrey and Kearns, Michael and Mansour, Yishay and Rudich, Steven , booktitle=. Weakly learning

work page
[18]

On the Power of Differentiable Learning versus

Abbe, Emmanuel and Kamath, Pritish and Malach, Eran and Sandon, Colin and Srebro, Nathan , booktitle=. On the Power of Differentiable Learning versus

work page
[19]

An intriguing failing of convolutional neural networks and the

Liu, Rosanne and Lehman, Joel and Molino, Piero and Such, Felipe Petroski and Frank, Eric and Sergeev, Alex and Yosinski, Jason , biburl =. An intriguing failing of convolutional neural networks and the. NeurIPS , ee =

work page
[20]

Proceedings of the 38th International Conference on Machine Learning , pages =

Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[21]

International Conference on Learning Representations (ICLR) , year=

Computational Separation Between Convolutional and Fully-Connected Networks , author=. International Conference on Learning Representations (ICLR) , year=

work page
[22]

Poly-time universality and limitations of deep learning , author=

work page
[23]

ArXiv , year=

Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization , author=. ArXiv , year=

work page
[24]

Enric Boix-Adsera , note=

work page
[25]

Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms , pages =

Das, Abhimanyu and Gollapudi, Sreenivas and Kumar, Ravi and Panigrahy, Rina , title =. Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms , pages =. 2020 , publisher =

work page 2020
[26]

A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds , publisher =

Tan, Yan Shuo and Agarwal, Abhineet and Yu, Bin , keywords =. A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2110.09626 , url =

work page doi:10.48550/arxiv.2110.09626 2021
[27]

On the Learnability of Deep Random Networks , publisher =

Das, Abhimanyu and Gollapudi, Sreenivas and Kumar, Ravi and Panigrahy, Rina , keywords =. On the Learnability of Deep Random Networks , publisher =. 2019 , copyright =. doi:10.48550/ARXIV.1904.03866 , url =

work page doi:10.48550/arxiv.1904.03866 2019
[28]

International Conference on Machine Learning , pages=

An initial alignment between neural network and target is needed for gradient descent to learn , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[29]

Deep Equals Shallow for ReLU Networks in Kernel Regimes , publisher =

Bietti, Alberto and Bach, Francis , keywords =. Deep Equals Shallow for ReLU Networks in Kernel Regimes , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2009.14397 , url =

work page doi:10.48550/arxiv.2009.14397 2020
[30]

Learning with convolution and pooling operations in kernel methods , publisher =

Misiakiewicz, Theodor and Mei, Song , keywords =. Learning with convolution and pooling operations in kernel methods , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2111.08308 , url =

work page doi:10.48550/arxiv.2111.08308 2021
[31]

Approximation and Learning with Deep Convolutional Models: a Kernel Perspective , publisher =

Bietti, Alberto , keywords =. Approximation and Learning with Deep Convolutional Models: a Kernel Perspective , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2102.10032 , url =

work page doi:10.48550/arxiv.2102.10032 2021
[32]

Distribution-Specific Hardness of Learning Neural Networks , publisher =

Shamir, Ohad , keywords =. Distribution-Specific Hardness of Learning Neural Networks , publisher =. 2016 , copyright =. doi:10.48550/ARXIV.1609.01037 , url =

work page doi:10.48550/arxiv.1609.01037 2016
[33]

When Hardness of Approximation Meets Hardness of Learning , publisher =

Malach, Eran and Shalev-Shwartz, Shai , keywords =. When Hardness of Approximation Meets Hardness of Learning , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2008.08059 , url =

work page doi:10.48550/arxiv.2008.08059 2020
[34]

arXiv preprint arXiv:1711.00165 , year=

Deep neural networks as gaussian processes , author=. arXiv preprint arXiv:1711.00165 , year=

work page arXiv
[35]

International Conference on Machine Learning , pages=

Quantifying the benefit of using differentiable learning over tangent kernels , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[36]

arXiv preprint arXiv:2206.10011 , year=

When Does Re-initialization Work? , author=. arXiv preprint arXiv:2206.10011 , year=

work page arXiv
[37]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[38]

and Pennington, Jeffrey and Adlam, Ben and Xiao, Lechao and Novak, Roman and Sohl-Dickstein, Jascha , keywords =

Lee, Jaehoon and Schoenholz, Samuel S. and Pennington, Jeffrey and Adlam, Ben and Xiao, Lechao and Novak, Roman and Sohl-Dickstein, Jascha , keywords =. Finite Versus Infinite Neural Networks: an Empirical Study , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2007.15801 , url =

work page doi:10.48550/arxiv.2007.15801 2020
[39]

arXiv preprint arXiv:2010.08515 , year=

Why are convolutional nets more sample-efficient than fully-connected nets? , author=. arXiv preprint arXiv:2010.08515 , year=

work page arXiv 2010
[40]

The Journal of Machine Learning Research , volume=

Neural architecture search: A survey , author=. The Journal of Machine Learning Research , volume=. 2019 , publisher=

work page 2019
[41]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page
[42]

International Conference on Machine Learning , pages=

KNAS: green neural architecture search , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[43]

arXiv preprint arXiv:2102.11535 , year=

Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective , author=. arXiv preprint arXiv:2102.11535 , year=

work page arXiv
[44]

arXiv preprint arXiv:2202.06438 , year=

Learning from Randomly Initialized Neural Network Features , author=. arXiv preprint arXiv:2202.06438 , year=

work page arXiv
[45]

arXiv preprint arXiv:2011.06006 , year=

Towards nngp-guided neural architecture search , author=. arXiv preprint arXiv:2011.06006 , year=

work page arXiv 2011
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[47]

Advances in Neural Information Processing Systems , volume=

What can linearized neural networks actually say about generalization? , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

Conference on Learning Theory , pages=

Learning with invariances in random features and kernel models , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[49]

Advances in Neural Information Processing Systems , volume=

On the sample complexity of learning under geometric stability , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

International Conference on Machine Learning , pages=

Provably strict generalisation benefit for equivariant models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[51]

arXiv preprint arXiv:1810.02054 , year=

Gradient descent provably optimizes over-parameterized neural networks , author=. arXiv preprint arXiv:1810.02054 , year=

work page arXiv
[52]

Advances in Neural Information Processing Systems , volume=

On the non-universality of deep learning: quantifying the cost of symmetry , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

Accepted at NeurIPS 2022 , year=

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures , author=. Accepted at NeurIPS 2022 , year=

work page 2022
[54]

Journal of the ACM (JACM) , volume=

Noise-tolerant learning, the parity problem, and the statistical query model , author=. Journal of the ACM (JACM) , volume=. 2003 , publisher=

work page 2003
[55]

International Conference on Machine Learning , pages=

Failures of gradient-based deep learning , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[56]

arXiv preprint arXiv:2207.08799 , year=

Hidden progress in deep learning: Sgd learns parities near the computational limit , author=. arXiv preprint arXiv:2207.08799 , year=

work page arXiv
[57]

arXiv preprint arXiv:2202.07626 , year=

Random feature amplification: Feature learning and generalization in neural networks , author=. arXiv preprint arXiv:2202.07626 , year=

work page arXiv
[58]

Advances in Neural Information Processing Systems , volume=

Learning parities with neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Acta Mathematica , volume=

The Fourier spectrum of critical percolation , author=. Acta Mathematica , volume=. 2010 , publisher=

work page 2010
[61]

Selected Works of Oded Schramm , pages=

Quantitative noise sensitivity and exceptional times for percolation , author=. Selected Works of Oded Schramm , pages=. 2011 , publisher=

work page 2011
[62]

arXiv preprint arXiv:2211.11567 , year=

Neural networks trained with SGD learn distributions of increasing complexity , author=. arXiv preprint arXiv:2211.11567 , year=

work page arXiv
[63]

arXiv preprint arXiv:2205.15809 , year=

Feature Learning in L\_ \ 2 \ -regularized DNNs: Attraction/Repulsion and Sparsity , author=. arXiv preprint arXiv:2205.15809 , year=

work page arXiv
[64]

Advances in neural information processing systems , volume=

Sgd on neural networks learns functions of increasing complexity , author=. Advances in neural information processing systems , volume=

work page
[65]

Planar random-cluster model: scaling relations , volume=

Duminil-Copin, Hugo and Manolescu, Ioan , year=. Planar random-cluster model: scaling relations , volume=. doi:10.1017/fmp.2022.16 , journal=

work page doi:10.1017/fmp.2022.16 2022
[66]

Journal of Statistical Mechanics: Theory and Experiment , volume=

An analytical theory of curriculum learning in teacher--student networks , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2022 , publisher=

work page 2022
[67]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=

work page
[68]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

A survey on curriculum learning , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[69]

Twenty-ninth AAAI conference on artificial intelligence , year=

Self-paced curriculum learning , author=. Twenty-ninth AAAI conference on artificial intelligence , year=

work page
[70]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

An empirical study of example forgetting during deep neural network learning , author=. arXiv preprint arXiv:1812.05159 , year=

work page arXiv
[71]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=

When does fading enhance perceptual category learning? , author=. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=. 2013 , publisher=

work page 2013
[72]

Memory & cognition , volume=

The effects of information order and learning mode on schema abstraction , author=. Memory & cognition , volume=. 1984 , publisher=

work page 1984
[73]

Cognitive psychology , volume=

A rational account of pedagogical reasoning: Teaching by, and learning from, examples , author=. Cognitive psychology , volume=. 2014 , publisher=

work page 2014
[74]

, author=

Generalizing from the use of earlier examples in problem solving. , author=. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=. 1990 , publisher=

work page 1990
[75]

2003 , publisher=

Introduction to numerical continuation methods , author=. 2003 , publisher=

work page 2003
[76]

Cognition , volume=

Learning and development in neural networks: The importance of starting small , author=. Cognition , volume=. 1993 , publisher=

work page 1993
[77]

Cognition , volume=

Flexible shaping: How learning in small steps helps , author=. Cognition , volume=. 2009 , publisher=

work page 2009
[78]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Curriculum learning of multiple tasks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[79]

Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

Curriculum learning for multi-task classification of visual attributes , author=. Proceedings of the IEEE International Conference on Computer Vision Workshops , pages=

work page
[80]

Nucleic acids research , volume=

Modeling multi-species RNA modification through multi-task curriculum learning , author=. Nucleic acids research , volume=. 2021 , publisher=

work page 2021

Showing first 80 references.