Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

Joseph An; Kai Hidajat; Solden Stoll

arxiv: 2605.15787 · v1 · pith:6SAZDN77new · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

Kai Hidajat , Solden Stoll , Joseph An This is my paper

Pith reviewed 2026-05-20 21:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords grokkingtransformersattentiongeneralizationbayesian inferencelottery ticketsstructural learningdelayed generalization

0 comments

The pith

Transformers generalize only after attention performs Bayesian inference over the full task dependency graph, separate from MLP memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the long delay before a transformer generalizes, called grokking, occurs because attention and the feed-forward layers solve two different problems that become decoupled early in training. Attention must learn to place enough probability mass on every token that carries task-relevant information, which the authors treat as inferring a hidden dependency graph in a Bayesian way. Once the MLP drives loss near zero by memorizing examples without this structure, attention receives almost no further gradient signal, so weight decay has to first undo the memorization before the missing dependencies can be discovered. This structural waiting time produces the observed inverse dependence on weight decay strength. The account also predicts that an explicit KL penalty pushing attention toward the right structure can shorten the delay according to an inverse scaling law.

Core claim

We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding an

What carries the argument

The implicit Bayesian posterior over the task dependency graph that attention must learn to place sufficient mass on every informative token.

If this is right

Generalization separates into an MLP capacity condition and an attention structure condition.
The grokking delay equals the explaining-away waiting time after memorization is eroded by weight decay.
A KL-based structural intervention produces an inverse-intervention-strength scaling law for grokking time.
Bayesian lottery tickets achieve generalization performance matching or exceeding standard lottery-ticket transfer on algorithmic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention layers in larger models may exhibit similar structural delays on natural-language tasks if no explicit pressure is applied to keep the dependency graph visible.
Architectures that maintain gradient flow to attention throughout training could eliminate grokking without relying on weight decay.
The same separation of concerns might appear in other attention-based sequence models whenever informative tokens can be ignored without immediate loss penalty.

Load-bearing premise

Attention behaves like a Bayesian update over task dependencies whose gradient signal disappears once the MLP has driven training loss to zero through memorization.

What would settle it

Running the proposed KL structural intervention on the algorithmic sequence tasks and finding that grokking time does not scale inversely with intervention strength would falsify the structural-inference account.

Figures

Figures reproduced from arXiv: 2605.15787 by Joseph An, Kai Hidajat, Solden Stoll.

**Figure 1.** Figure 1: Four Phases of Grokking. (Left) Test accuracy rises only after structural divergence DKL (red, dashed) has largely collapsed. (Right) The attention task gradient norm falls by orders of magnitude after memorization, producing the Explaining-Away Plateau. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Baseline 0 2000 4000 6000 8000 10000 Epoch Oracle Prior 0 2000 4000 6000 8000 1000… view at source ↗

**Figure 2.** Figure 2: Isolating Structure from Capacity. Training trajectories under independent control of the Norm Condition (N ) and Structural Condition (Bγ). Adversarial routing prevents generalization even under norm control, while oracle routing without norm control gives only partial generalization. Isolating Norm from Structure Theorem 4.3 states that both the Goldilocks Norm Condition (N ) and the Bayesian Structural … view at source ↗

**Figure 3.** Figure 3: KL Acceleration. (Left) Injecting the structural prior β accelerates generalization. (Right) The grokking delay ∆tgrok scales linearly with 1/β. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Baseline Lottery Ticket Ours Combined 0 500 1000 1500 2000 2500 LT Grokking Epoch 0 500 1000 1500 2000 2500 Ours Grokking Epoch [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Transferring the Bayesian Ticket. (Left) Regularizing with the structural prior (“Ours”) matches or outpaces transferring a full Lottery Ticket. (Right) The Bayesian Ticket matches or beats the Lottery Ticket across random initializations [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Explaining-away bound. The empirical attention-logit gradient remains below the theoretical bound from Lemma 5.1 in both baseline and KL-regularized training. The bound is conservative, but the qualitative statement is sharp: once cross-entropy is small, the task gradient into attention is tiny. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Regularizer Comparison Weight Decay Only… view at source ↗

**Figure 6.** Figure 6: Structural priors. (Left) Generic sparsity pressures help, but the KL prior reaches the generalizing solution fastest because it specifies where sparse mass should go. (Right) A learned teacher attention map transfers nearly the same acceleration as the oracle prior. F.2 Structural Priors, Sparse Priors, and Distillation These ablations clarify what the KL intervention is doing. Entropy and ℓ1 penalties ac… view at source ↗

**Figure 7.** Figure 7: Timing the structural intervention. (Left) A short early KL intervention almost matches an always-on prior, suggesting that the routing ticket persists after the prior is removed. (Right) Late injection still accelerates grokking, but the transition moves later with the activation epoch. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Grokking Dynamics (Dynamic Positions) Baseline K… view at source ↗

**Figure 8.** Figure 8: Routing under distractors. (Left) With informative positions randomized per sequence, a sequence-dependent structural prior α ∗ (s) still bypasses the plateau. (Right) Under added distractors, KL-trained models extrapolate better to longer contexts, although the longest lengths remain imperfect. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Test Accuracy Impact of Embedding Dimension d=32 … view at source ↗

**Figure 9.** Figure 9: Geometry and distributed attention. (Left) Larger embedding dimension accelerates generalization, consistent with the geometric role of representation separation and subspace incoherence. (Right) In a 4-head model, aggregate oracle attention mass rapidly approaches one; KL keeps this distributed mass more tightly aligned. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Task diversity. Modular addition gives the cleanest delayed transition, while sparse parity and permutation composition show noisier or faster transitions. Across all three tasks, the attention-gradient diagnostic remains bounded, and KL accelerates the structural route when a plateau is present. F.4 Task and Optimizer Robustness The task grid in [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Decoupled and optimizer controls. (Left) Once attention is frozen to oracle routing, the downstream network generalizes far earlier than the fully coupled baseline. (Right) MLP norm decay persists across batch-size and Adam momentum variants, supporting the qualitative norm-contraction assumption. G.2 Model Architecture and Hyperparameters Unless otherwise specified, we used a single-layer Transformer wit… view at source ↗

read the original abstract

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames grokking as a delay in attention learning task dependencies via a Bayesian posterior, with a KL intervention that alters timing, but the gradient-starvation step lacks explicit derivation.

read the letter

The main thing to know is that this paper treats grokking as the time needed for attention to infer the right token dependencies after the MLP has already memorized the data. They model attention as an implicit Bayesian posterior over the task graph, separate this structural requirement from the usual MLP capacity bound, and derive both the inverse weight-decay delay and a KL-based bypass with its own scaling law.

Referee Report

3 major / 2 minor

Summary. The paper claims that grokking in Transformers results from delayed structural inference in attention. It formalizes attention as an implicit Bayesian posterior over the task dependency graph and proves that generalization requires two separable conditions: a Goldilocks bound on MLP capacity (aligning with norm-based accounts) and a Bayesian structural condition ensuring sufficient attention mass on every informative token. Early memorization drives cross-entropy near zero, starving attention of structural gradient; weight decay must then erode this memorization, yielding the observed inverse-weight-decay delay, which the authors derive as a structural waiting time. A KL-based structural intervention is shown to bypass the delay with an inverse-intervention-strength scaling law. Experiments on algorithmic sequence tasks isolate structure from capacity and indicate that the proposed Bayesian ticket matches or outperforms lottery-ticket transfer.

Significance. If the central derivations hold, the work provides a useful decoupling of capacity and structural conditions that could unify existing grokking explanations with a Bayesian view of attention. The explicit derivation of the waiting time as an explaining-away effect and the intervention scaling law are potentially valuable, as are the experiments that attempt to separate structural from capacity effects. The manuscript ships a falsifiable prediction (inverse-intervention-strength scaling) and reproducible experimental controls on algorithmic tasks.

major comments (3)

[Formalization of attention] Formalization paragraph (beginning 'We formalize attention as an implicit Bayesian posterior'): the mapping from attention logits to an implicit posterior over the task dependency graph is asserted but not derived. Without the explicit posterior expression and the resulting gradient with respect to attention parameters, it remains unclear whether cross-entropy minimization produces strict gradient starvation once loss is small but nonzero, especially under multi-head or multi-layer interactions.
[Derivation of structural waiting time] Derivation of structural waiting time (section presenting the inverse-weight-decay delay): the claim that the delay is a derived structural waiting time requires showing that the waiting-time expression is independent of the same fitted parameters used to define the model itself. The current presentation leaves open the possibility that the derived quantity reduces by construction to a reparameterization of the fitted weight-decay schedule.
[Proofs of the two conditions] Proof of the two separable conditions (section asserting proofs of Goldilocks bound and Bayesian structural condition): the abstract states that both conditions are proved, yet the manuscript supplies no lemmas, equations, or explicit bounds. A load-bearing claim of separability therefore rests on an unshown argument; the experimental isolation of structure from capacity cannot substitute for the missing derivation.

minor comments (2)

[Introduction / Related work] Notation for the Bayesian lottery ticket is introduced without a direct comparison table to the standard lottery-ticket hypothesis; a short side-by-side would clarify the claimed novelty.
[Experiments] Figure captions for the algorithmic-task experiments should explicitly state the number of random seeds and whether error bars reflect standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where the formal arguments can be strengthened, and we will incorporate revisions to address each point explicitly while preserving the core contributions on decoupling capacity and structural conditions.

read point-by-point responses

Referee: Formalization paragraph (beginning 'We formalize attention as an implicit Bayesian posterior'): the mapping from attention logits to an implicit posterior over the task dependency graph is asserted but not derived. Without the explicit posterior expression and the resulting gradient with respect to attention parameters, it remains unclear whether cross-entropy minimization produces strict gradient starvation once loss is small but nonzero, especially under multi-head or multi-layer interactions.

Authors: We agree that the current presentation would benefit from greater explicitness. In the revised manuscript we will derive the mapping from attention logits to the implicit posterior over the task dependency graph, obtain the corresponding gradient with respect to attention parameters, and show that cross-entropy minimization produces gradient starvation for structural learning once the loss falls below a quantifiable threshold. The derivation will be extended to multi-head and multi-layer settings via appropriate product bounds on attention mass. revision: yes
Referee: Derivation of structural waiting time (section presenting the inverse-weight-decay delay): the claim that the delay is a derived structural waiting time requires showing that the waiting-time expression is independent of the same fitted parameters used to define the model itself. The current presentation leaves open the possibility that the derived quantity reduces by construction to a reparameterization of the fitted weight-decay schedule.

Authors: The structural waiting time is obtained from the explaining-away dynamics of the Bayesian posterior over the dependency graph and depends only on the structural parameters of that graph together with the weight-decay coefficient. We will add an explicit subsection that isolates these structural parameters from the learned MLP weights, thereby demonstrating that the waiting-time expression is not a reparameterization of the training schedule but a direct consequence of the attention mechanism's inference process. revision: yes
Referee: Proof of the two separable conditions (section asserting proofs of Goldilocks bound and Bayesian structural condition): the abstract states that both conditions are proved, yet the manuscript supplies no lemmas, equations, or explicit bounds. A load-bearing claim of separability therefore rests on an unshown argument; the experimental isolation of structure from capacity cannot substitute for the missing derivation.

Authors: We acknowledge that the initial submission presented the proofs at a high level. The Goldilocks bound on MLP capacity recovers known norm-based results, while the Bayesian structural condition follows from a lower bound on attention mass required for every informative token. In the revision we will supply the missing lemmas and explicit bounds that establish separability of the two conditions. The algorithmic-task experiments remain as empirical corroboration but will no longer be asked to stand in for the theoretical argument. revision: yes

Circularity Check

0 steps flagged

Derivation chain self-contained with independent Bayesian formalization

full rationale

The paper formalizes attention as an implicit Bayesian posterior over the task dependency graph, proves two separable conditions (Goldilocks MLP capacity bound plus Bayesian structural mass requirement), and derives the inverse-weight-decay delay as an explaining-away structural waiting time from early cross-entropy minimization starving structural gradients. No quoted equations or steps reduce the derived waiting time to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation or ansatz smuggled from prior work. The central decoupling supplies independent content beyond norm-based or lottery-ticket accounts, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on treating attention as Bayesian posterior inference over a task graph and on the assumption that early memorization removes gradient for that inference.

axioms (1)

domain assumption Attention implements an implicit Bayesian posterior over the task dependency graph
Invoked to prove the structural condition and the explaining-away delay.

invented entities (1)

Bayesian lottery ticket no independent evidence
purpose: Structural subnetwork that places mass on informative tokens
Introduced to explain generalization after the memorization phase; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5768 in / 1173 out tokens · 36976 ms · 2026-05-20T21:07:44.457394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages

[1]

2024 , publisher =

Abbe, Emmanuel and Boix-Adsera, Enric and Misiakiewicz, Theodor , title =. 2024 , publisher =

work page 2024
[2]

2024 , publisher =

Ahn, Kwangjun and Cheng, Xiang and Song, Minhak and Yun, Chulhee and Jadbabaie, Ali and Sra, Suvrit , title =. 2024 , publisher =

work page 2024
[3]

2019 , publisher =

Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao , title =. 2019 , publisher =

work page 2019
[4]

The Annals of Statistics , volume =

Anandkumar, Animashree and Valluvan, Ragupathyraj , title =. The Annals of Statistics , volume =. 2013 , url =

work page 2013
[5]

2023 , publisher =

Bai, Yu and Chen, Fan and Wang, Huan and Xiong, Caiming and Mei, Song , title =. 2023 , publisher =

work page 2023
[6]

and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =

Barak, Boaz and Edelman, Benjamin L. and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =. 2023 , publisher =

work page 2023
[7]

, title =

Barnfield, Nicholas and Cui, Hugo and Lu, Yue M. , title =. 2025 , publisher =

work page 2025
[8]

2023 , publisher =

Battiloro, Claudio and Spinelli, Indro and Telyatnikov, Lev and Bronstein, Michael and Scardapane, Simone and Lorenzo, Paolo Di , title =. 2023 , publisher =

work page 2023
[9]

and Kucukelbir, Alp and McAuliffe, Jon D

Blei, David M. and Kucukelbir, Alp and McAuliffe, Jon D. , title =. Journal of the American Statistical Association , volume =. 2017 , pages =

work page 2017
[10]

2025 , publisher =

Borde, Haitz Sáez de Ocáriz and Kratsios, Anastasis , title =. 2025 , publisher =

work page 2025
[11]

2025 , publisher =

Boursier, Etienne and Pesme, Scott and Dragomir, Radu-Alexandru , title =. 2025 , publisher =

work page 2025
[12]

2025 , publisher =

Chen, Zheng-An and Luo, Tao , title =. 2025 , publisher =

work page 2025
[13]

2018 , publisher =

Chizat, Lenaic and Bach, Francis , title =. 2018 , publisher =

work page 2018
[14]

Choi, Myung Jin and Tan, Vincent Y. F. and Anandkumar, Animashree and Willsky, Alan S. , title =. 2010 , publisher =

work page 2010
[15]

2024 , publisher =

Clauw, Kenzo and Stramaglia, Sebastiano and Marinazzo, Daniele , title =. 2024 , publisher =

work page 2024
[16]

Transactions on Machine Learning Research , year =

Darvariu, Victor-Alexandru and Hailes, Stephen and Musolesi, Mirco , title =. Transactions on Machine Learning Research , year =

work page
[17]

2023 , publisher =

Davies, Xander and Langosco, Lauro and Krueger, David , title =. 2023 , publisher =

work page 2023
[18]

2025 , publisher =

Deng, Yichuan and Song, Zhao and Xiong, Jing and Yang, Chiwun , title =. 2025 , publisher =

work page 2025
[19]

IEEE Signal Processing Magazine , volume =

Dong, Xiaowen and Thanou, Dorina and Rabbat, Michael and Frossard, Pascal , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

work page 2019
[20]

and Lee, Jason D

Du, Simon S. and Lee, Jason D. and Li, Haochuan and Wang, Liwei and Zhai, Xiyu , title =. 2019 , publisher =

work page 2019
[21]

2024 , publisher =

DuSell, Brian and Chiang, David , title =. 2024 , publisher =

work page 2024
[22]

2020 , publisher =

Ebli, Stefania and Defferrard, Michaël and Spreemann, Gard , title =. 2020 , publisher =

work page 2020
[23]

2021 , publisher =

Fatemi, Bahare and Asri, Layla El and Kazemi, Seyed Mehran , title =. 2021 , publisher =

work page 2021
[24]

2020 , publisher =

Franceschi, Luca and Niepert, Mathias and Pontil, Massimiliano and He, Xiao , title =. 2020 , publisher =

work page 2020
[25]

2019 , publisher =

Frankle, Jonathan and Carbin, Michael , title =. 2019 , publisher =

work page 2019
[26]

2024 , publisher =

Golechha, Satvik , title =. 2024 , publisher =

work page 2024
[27]

2019 , publisher =

Grover, Aditya and Zweig, Aaron and Ermon, Stefano , title =. 2019 , publisher =

work page 2019
[28]

2023 , publisher =

Gu, Ming and Yang, Gaoming and Zhou, Sheng and Ma, Ning and Chen, Jiawei and Tan, Qiaoyu and Liu, Meihan and Bu, Jiajun , title =. 2023 , publisher =

work page 2023
[29]

2023 , url =

Gurugubelli, Sravanthi and Chepuri, Sundeep Prabhakar , title =. 2023 , url =

work page 2023
[30]

2022 , publisher =

Hu, Xiaoling and Samaras, Dimitris and Chen, Chao , title =. 2022 , publisher =

work page 2022
[31]

2025 , publisher =

Jeffares, Alan and Schaar, Mihaela van der , title =. 2025 , publisher =

work page 2025
[32]

2020 , publisher =

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François , title =. 2020 , publisher =

work page 2020
[33]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Kazi, Anees and Cosmo, Luca and Ahmadi, Seyed-Ahmad and Navab, Nassir and Bronstein, Michael , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2023 , pages =

work page 2023
[34]

2026 , publisher =

Khanh, Truong Xuan and Hoa, Truong Quynh and Trung, Luu Duc and Duc, Phan Thanh , title =. 2026 , publisher =

work page 2026
[35]

2018 , publisher =

Kipf, Thomas and Fetaya, Ethan and Wang, Kuan-Chieh and Welling, Max and Zemel, Richard , title =. 2018 , publisher =

work page 2018
[36]

, title =

Korbak, Tomasz and Perez, Ethan and Buckley, Christopher L. , title =. 2022 , publisher =

work page 2022
[37]

and Palomar, Daniel P

Kumar, Sandeep and Ying, Jiaxi and Cardoso, José Vinícius de M. and Palomar, Daniel P. , title =. Journal of Machine Learning Research , volume =. 2020 , pages =

work page 2020
[38]

Advances in Neural Information Processing Systems , volume =

Kumar, Sandeep and Ying, Jiaxi and de Miranda Cardoso, Jose Vinicius and Palomar, Daniel , title =. Advances in Neural Information Processing Systems , volume =. 2019 , publisher =

work page 2019
[39]

and Pehlevan, Cengiz , title =

Kumar, Tanishq and Bordelon, Blake and Gershman, Samuel J. and Pehlevan, Cengiz , title =. 2024 , publisher =

work page 2024
[40]

2020 , publisher =

Lachapelle, Sébastien and Brouillard, Philippe and Deleu, Tristan and Lacoste-Julien, Simon , title =. 2020 , publisher =

work page 2020
[41]

2025 , publisher =

Lapenna, Michela and Bacco, Caterina De , title =. 2025 , publisher =

work page 2025
[42]

2024 , publisher =

Lee, Jaerin and Kang, Bong Gyun and Kim, Kihoon and Lee, Kyoung Mu , title =. 2024 , publisher =

work page 2024
[43]

2020 , publisher =

Li, Zongyi and Kovachki, Nikola and Azizzadenesheli, Kamyar and Liu, Burigede and Bhattacharya, Kaushik and Stuart, Andrew and Anandkumar, Anima , title =. 2020 , publisher =

work page 2020
[44]

and Tegmark, Max and Williams, Mike , title =

Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. 2022 , publisher =

work page 2022
[45]

and Tegmark, Max , title =

Liu, Ziming and Michaud, Eric J. and Tegmark, Max , title =. 2023 , publisher =

work page 2023
[46]

2021 , publisher =

Lorch, Lars and Rothfuss, Jonas and Schölkopf, Bernhard and Krause, Andreas , title =. 2021 , publisher =

work page 2021
[47]

2019 , publisher =

Loshchilov, Ilya and Hutter, Frank , title =. 2019 , publisher =

work page 2019
[48]

2023 , publisher =

Lu, Jianglin and Xu, Yi and Wang, Huan and Bai, Yue and Fu, Yun , title =. 2023 , publisher =

work page 2023
[49]

and Lee, Jason D

Lyu, Kaifeng and Jin, Jikai and Li, Zhiyuan and Du, Simon S. and Lee, Jason D. and Hu, Wei , title =. 2024 , publisher =

work page 2024
[50]

2020 , publisher =

Lyu, Kaifeng and Li, Jian , title =. 2020 , publisher =

work page 2020
[51]

2025 , publisher =

Maasch, Jacqueline and Neiswanger, Willie and Ermon, Stefano and Kuleshov, Volodymyr , title =. 2025 , publisher =

work page 2025
[52]

2025 , publisher =

Manenti, Alessandro and Zambon, Daniele and Alippi, Cesare , title =. 2025 , publisher =

work page 2025
[53]

2012 , publisher =

Mansinghka, Vikash and Kemp, Charles and Griffiths, Thomas and Tenenbaum, Joshua , title =. 2012 , publisher =

work page 2012
[54]

2025 , publisher =

Marinucci, Lorenzo and Nino, Leonardo Di and D’Acunto, Gabriele and Pandolfo, Mario Edoardo and Lorenzo, Paolo Di and Barbarossa, Sergio , title =. 2025 , publisher =

work page 2025
[55]

and Ribeiro, Alejandro , title =

Mateos, Gonzalo and Segarra, Santiago and Marques, Antonio G. and Ribeiro, Alejandro , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

work page 2019
[56]

2019 , publisher =

McKenna, Ryan and Sheldon, Daniel and Miklau, Gerome , title =. 2019 , publisher =

work page 2019
[57]

Proceedings of the National Academy of Sciences , volume =

Mei, Song and Montanari, Andrea and Nguyen, Phan-Minh , title =. Proceedings of the National Academy of Sciences , volume =. 2018 , url =

work page 2018
[58]

2023 , publisher =

Merrill, William and Tsilivis, Nikolaos and Shukla, Aman , title =. 2023 , publisher =

work page 2023
[59]

2025 , publisher =

Minegishi, Gouki and Iwasawa, Yusuke and Matsuo, Yutaka , title =. 2025 , publisher =

work page 2025
[60]

, title =

Mousavi-Hosseini, Alireza and Sanford, Clayton and Wu, Denny and Erdogdu, Murat A. , title =. 2025 , publisher =

work page 2025
[61]

, title =

Murty, Shikhar and Sharma, Pratyusha and Andreas, Jacob and Manning, Christopher D. , title =. 2023 , publisher =

work page 2023
[62]

2026 , publisher =

Musat, Tiberiu , title =. 2026 , publisher =

work page 2026
[63]

2023 , publisher =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. 2023 , publisher =

work page 2023
[64]

2025 , publisher =

Notsawo, Pascal Jr Tikeng and Dumas, Guillaume and Rabusseau, Guillaume , title =. 2025 , publisher =

work page 2025
[65]

2023 , publisher =

Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. 2023 , publisher =

work page 2023
[66]

IDMT-Traffic: An Open Bench- mark Dataset for Acoustic Traffic Monitoring Research

Pastorino, Martina and Moser, Gabriele and Serpico, Sebastiano B. and Zerubia, Josiane , title =. 2021 29th European Signal Processing Conference (EUSIPCO) , year =. doi:10.23919/EUSIPCO54536.2021.9616179 , address =

work page doi:10.23919/eusipco54536.2021.9616179 2021
[67]

1988 , publisher =

Pearl, Judea , title =. 1988 , publisher =

work page 1988
[68]

2009 , publisher =

Pearl, Judea , title =. 2009 , publisher =

work page 2009
[69]

2021 , publisher =

Pezeshki, Mohammad and Kaba, Sékou-Oumar and Bengio, Yoshua and Courville, Aaron and Precup, Doina and Lajoie, Guillaume , title =. 2021 , publisher =

work page 2021
[70]

2022 , publisher =

Phuong, Mary and Hutter, Marcus , title =. 2022 , publisher =

work page 2022
[71]

2022 , publisher =

Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant , title =. 2022 , publisher =

work page 2022
[72]

Prieto, Lucas and Barsbey, Melih and Mediano, Pedro A. M. and Birdal, Tolga , title =. 2025 , publisher =

work page 2025
[73]

2021 , publisher =

Pu, Xingyue and Cao, Tianyue and Zhang, Xiaoyun and Dong, Xiaowen and Chen, Siheng , title =. 2021 , publisher =

work page 2021
[74]

IEEE Access , volume =

Ryu, Junseung and Cho, Namkyeong and Hwang, Hyung Ju , title =. IEEE Access , volume =. 2025 , pages =

work page 2025
[75]

, title =

Sanchez-Lengeling, Benjamin and Reif, Emily and Pearce, Adam and Wiltschko, Alexander B. , title =. Distill , volume =. 2021 , pages =

work page 2021
[76]

2023 , publisher =

Sanford, Clayton and Hsu, Daniel and Telgarsky, Matus , title =. 2023 , publisher =

work page 2023
[77]

and Gori, M

Scarselli, F. and Gori, M. and Ah Chung Tsoi and Hagenbuchner, M. and Monfardini, G. , title =. IEEE Transactions on Neural Networks , volume =. 2009 , pages =

work page 2009
[78]

2021 , publisher =

Schölkopf, Bernhard and Locatello, Francesco and Bauer, Stefan and Ke, Nan Rosemary and Kalchbrenner, Nal and Goyal, Anirudh and Bengio, Yoshua , title =. 2021 , publisher =

work page 2021
[79]

and Mateos, Gonzalo and Ribeiro, Alejandro , title =

Segarra, Santiago and Marques, Antonio G. and Mateos, Gonzalo and Ribeiro, Alejandro , title =. 2016 , publisher =

work page 2016
[80]

2025 , publisher =

Si, Chongjie and Zhang, Debing and Shen, Wei , title =. 2025 , publisher =

work page 2025

Showing first 80 references.

[1] [1]

2024 , publisher =

Abbe, Emmanuel and Boix-Adsera, Enric and Misiakiewicz, Theodor , title =. 2024 , publisher =

work page 2024

[2] [2]

2024 , publisher =

Ahn, Kwangjun and Cheng, Xiang and Song, Minhak and Yun, Chulhee and Jadbabaie, Ali and Sra, Suvrit , title =. 2024 , publisher =

work page 2024

[3] [3]

2019 , publisher =

Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao , title =. 2019 , publisher =

work page 2019

[4] [4]

The Annals of Statistics , volume =

Anandkumar, Animashree and Valluvan, Ragupathyraj , title =. The Annals of Statistics , volume =. 2013 , url =

work page 2013

[5] [5]

2023 , publisher =

Bai, Yu and Chen, Fan and Wang, Huan and Xiong, Caiming and Mei, Song , title =. 2023 , publisher =

work page 2023

[6] [6]

and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =

Barak, Boaz and Edelman, Benjamin L. and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =. 2023 , publisher =

work page 2023

[7] [7]

, title =

Barnfield, Nicholas and Cui, Hugo and Lu, Yue M. , title =. 2025 , publisher =

work page 2025

[8] [8]

2023 , publisher =

Battiloro, Claudio and Spinelli, Indro and Telyatnikov, Lev and Bronstein, Michael and Scardapane, Simone and Lorenzo, Paolo Di , title =. 2023 , publisher =

work page 2023

[9] [9]

and Kucukelbir, Alp and McAuliffe, Jon D

Blei, David M. and Kucukelbir, Alp and McAuliffe, Jon D. , title =. Journal of the American Statistical Association , volume =. 2017 , pages =

work page 2017

[10] [10]

2025 , publisher =

Borde, Haitz Sáez de Ocáriz and Kratsios, Anastasis , title =. 2025 , publisher =

work page 2025

[11] [11]

2025 , publisher =

Boursier, Etienne and Pesme, Scott and Dragomir, Radu-Alexandru , title =. 2025 , publisher =

work page 2025

[12] [12]

2025 , publisher =

Chen, Zheng-An and Luo, Tao , title =. 2025 , publisher =

work page 2025

[13] [13]

2018 , publisher =

Chizat, Lenaic and Bach, Francis , title =. 2018 , publisher =

work page 2018

[14] [14]

Choi, Myung Jin and Tan, Vincent Y. F. and Anandkumar, Animashree and Willsky, Alan S. , title =. 2010 , publisher =

work page 2010

[15] [15]

2024 , publisher =

Clauw, Kenzo and Stramaglia, Sebastiano and Marinazzo, Daniele , title =. 2024 , publisher =

work page 2024

[16] [16]

Transactions on Machine Learning Research , year =

Darvariu, Victor-Alexandru and Hailes, Stephen and Musolesi, Mirco , title =. Transactions on Machine Learning Research , year =

work page

[17] [17]

2023 , publisher =

Davies, Xander and Langosco, Lauro and Krueger, David , title =. 2023 , publisher =

work page 2023

[18] [18]

2025 , publisher =

Deng, Yichuan and Song, Zhao and Xiong, Jing and Yang, Chiwun , title =. 2025 , publisher =

work page 2025

[19] [19]

IEEE Signal Processing Magazine , volume =

Dong, Xiaowen and Thanou, Dorina and Rabbat, Michael and Frossard, Pascal , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

work page 2019

[20] [20]

and Lee, Jason D

Du, Simon S. and Lee, Jason D. and Li, Haochuan and Wang, Liwei and Zhai, Xiyu , title =. 2019 , publisher =

work page 2019

[21] [21]

2024 , publisher =

DuSell, Brian and Chiang, David , title =. 2024 , publisher =

work page 2024

[22] [22]

2020 , publisher =

Ebli, Stefania and Defferrard, Michaël and Spreemann, Gard , title =. 2020 , publisher =

work page 2020

[23] [23]

2021 , publisher =

Fatemi, Bahare and Asri, Layla El and Kazemi, Seyed Mehran , title =. 2021 , publisher =

work page 2021

[24] [24]

2020 , publisher =

Franceschi, Luca and Niepert, Mathias and Pontil, Massimiliano and He, Xiao , title =. 2020 , publisher =

work page 2020

[25] [25]

2019 , publisher =

Frankle, Jonathan and Carbin, Michael , title =. 2019 , publisher =

work page 2019

[26] [26]

2024 , publisher =

Golechha, Satvik , title =. 2024 , publisher =

work page 2024

[27] [27]

2019 , publisher =

Grover, Aditya and Zweig, Aaron and Ermon, Stefano , title =. 2019 , publisher =

work page 2019

[28] [28]

2023 , publisher =

Gu, Ming and Yang, Gaoming and Zhou, Sheng and Ma, Ning and Chen, Jiawei and Tan, Qiaoyu and Liu, Meihan and Bu, Jiajun , title =. 2023 , publisher =

work page 2023

[29] [29]

2023 , url =

Gurugubelli, Sravanthi and Chepuri, Sundeep Prabhakar , title =. 2023 , url =

work page 2023

[30] [30]

2022 , publisher =

Hu, Xiaoling and Samaras, Dimitris and Chen, Chao , title =. 2022 , publisher =

work page 2022

[31] [31]

2025 , publisher =

Jeffares, Alan and Schaar, Mihaela van der , title =. 2025 , publisher =

work page 2025

[32] [32]

2020 , publisher =

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François , title =. 2020 , publisher =

work page 2020

[33] [33]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Kazi, Anees and Cosmo, Luca and Ahmadi, Seyed-Ahmad and Navab, Nassir and Bronstein, Michael , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2023 , pages =

work page 2023

[34] [34]

2026 , publisher =

Khanh, Truong Xuan and Hoa, Truong Quynh and Trung, Luu Duc and Duc, Phan Thanh , title =. 2026 , publisher =

work page 2026

[35] [35]

2018 , publisher =

Kipf, Thomas and Fetaya, Ethan and Wang, Kuan-Chieh and Welling, Max and Zemel, Richard , title =. 2018 , publisher =

work page 2018

[36] [36]

, title =

Korbak, Tomasz and Perez, Ethan and Buckley, Christopher L. , title =. 2022 , publisher =

work page 2022

[37] [37]

and Palomar, Daniel P

Kumar, Sandeep and Ying, Jiaxi and Cardoso, José Vinícius de M. and Palomar, Daniel P. , title =. Journal of Machine Learning Research , volume =. 2020 , pages =

work page 2020

[38] [38]

Advances in Neural Information Processing Systems , volume =

Kumar, Sandeep and Ying, Jiaxi and de Miranda Cardoso, Jose Vinicius and Palomar, Daniel , title =. Advances in Neural Information Processing Systems , volume =. 2019 , publisher =

work page 2019

[39] [39]

and Pehlevan, Cengiz , title =

Kumar, Tanishq and Bordelon, Blake and Gershman, Samuel J. and Pehlevan, Cengiz , title =. 2024 , publisher =

work page 2024

[40] [40]

2020 , publisher =

Lachapelle, Sébastien and Brouillard, Philippe and Deleu, Tristan and Lacoste-Julien, Simon , title =. 2020 , publisher =

work page 2020

[41] [41]

2025 , publisher =

Lapenna, Michela and Bacco, Caterina De , title =. 2025 , publisher =

work page 2025

[42] [42]

2024 , publisher =

Lee, Jaerin and Kang, Bong Gyun and Kim, Kihoon and Lee, Kyoung Mu , title =. 2024 , publisher =

work page 2024

[43] [43]

2020 , publisher =

Li, Zongyi and Kovachki, Nikola and Azizzadenesheli, Kamyar and Liu, Burigede and Bhattacharya, Kaushik and Stuart, Andrew and Anandkumar, Anima , title =. 2020 , publisher =

work page 2020

[44] [44]

and Tegmark, Max and Williams, Mike , title =

Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. 2022 , publisher =

work page 2022

[45] [45]

and Tegmark, Max , title =

Liu, Ziming and Michaud, Eric J. and Tegmark, Max , title =. 2023 , publisher =

work page 2023

[46] [46]

2021 , publisher =

Lorch, Lars and Rothfuss, Jonas and Schölkopf, Bernhard and Krause, Andreas , title =. 2021 , publisher =

work page 2021

[47] [47]

2019 , publisher =

Loshchilov, Ilya and Hutter, Frank , title =. 2019 , publisher =

work page 2019

[48] [48]

2023 , publisher =

Lu, Jianglin and Xu, Yi and Wang, Huan and Bai, Yue and Fu, Yun , title =. 2023 , publisher =

work page 2023

[49] [49]

and Lee, Jason D

Lyu, Kaifeng and Jin, Jikai and Li, Zhiyuan and Du, Simon S. and Lee, Jason D. and Hu, Wei , title =. 2024 , publisher =

work page 2024

[50] [50]

2020 , publisher =

Lyu, Kaifeng and Li, Jian , title =. 2020 , publisher =

work page 2020

[51] [51]

2025 , publisher =

Maasch, Jacqueline and Neiswanger, Willie and Ermon, Stefano and Kuleshov, Volodymyr , title =. 2025 , publisher =

work page 2025

[52] [52]

2025 , publisher =

Manenti, Alessandro and Zambon, Daniele and Alippi, Cesare , title =. 2025 , publisher =

work page 2025

[53] [53]

2012 , publisher =

Mansinghka, Vikash and Kemp, Charles and Griffiths, Thomas and Tenenbaum, Joshua , title =. 2012 , publisher =

work page 2012

[54] [54]

2025 , publisher =

Marinucci, Lorenzo and Nino, Leonardo Di and D’Acunto, Gabriele and Pandolfo, Mario Edoardo and Lorenzo, Paolo Di and Barbarossa, Sergio , title =. 2025 , publisher =

work page 2025

[55] [55]

and Ribeiro, Alejandro , title =

Mateos, Gonzalo and Segarra, Santiago and Marques, Antonio G. and Ribeiro, Alejandro , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

work page 2019

[56] [56]

2019 , publisher =

McKenna, Ryan and Sheldon, Daniel and Miklau, Gerome , title =. 2019 , publisher =

work page 2019

[57] [57]

Proceedings of the National Academy of Sciences , volume =

Mei, Song and Montanari, Andrea and Nguyen, Phan-Minh , title =. Proceedings of the National Academy of Sciences , volume =. 2018 , url =

work page 2018

[58] [58]

2023 , publisher =

Merrill, William and Tsilivis, Nikolaos and Shukla, Aman , title =. 2023 , publisher =

work page 2023

[59] [59]

2025 , publisher =

Minegishi, Gouki and Iwasawa, Yusuke and Matsuo, Yutaka , title =. 2025 , publisher =

work page 2025

[60] [60]

, title =

Mousavi-Hosseini, Alireza and Sanford, Clayton and Wu, Denny and Erdogdu, Murat A. , title =. 2025 , publisher =

work page 2025

[61] [61]

, title =

Murty, Shikhar and Sharma, Pratyusha and Andreas, Jacob and Manning, Christopher D. , title =. 2023 , publisher =

work page 2023

[62] [62]

2026 , publisher =

Musat, Tiberiu , title =. 2026 , publisher =

work page 2026

[63] [63]

2023 , publisher =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. 2023 , publisher =

work page 2023

[64] [64]

2025 , publisher =

Notsawo, Pascal Jr Tikeng and Dumas, Guillaume and Rabusseau, Guillaume , title =. 2025 , publisher =

work page 2025

[65] [65]

2023 , publisher =

Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. 2023 , publisher =

work page 2023

[66] [66]

IDMT-Traffic: An Open Bench- mark Dataset for Acoustic Traffic Monitoring Research

Pastorino, Martina and Moser, Gabriele and Serpico, Sebastiano B. and Zerubia, Josiane , title =. 2021 29th European Signal Processing Conference (EUSIPCO) , year =. doi:10.23919/EUSIPCO54536.2021.9616179 , address =

work page doi:10.23919/eusipco54536.2021.9616179 2021

[67] [67]

1988 , publisher =

Pearl, Judea , title =. 1988 , publisher =

work page 1988

[68] [68]

2009 , publisher =

Pearl, Judea , title =. 2009 , publisher =

work page 2009

[69] [69]

2021 , publisher =

Pezeshki, Mohammad and Kaba, Sékou-Oumar and Bengio, Yoshua and Courville, Aaron and Precup, Doina and Lajoie, Guillaume , title =. 2021 , publisher =

work page 2021

[70] [70]

2022 , publisher =

Phuong, Mary and Hutter, Marcus , title =. 2022 , publisher =

work page 2022

[71] [71]

2022 , publisher =

Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant , title =. 2022 , publisher =

work page 2022

[72] [72]

Prieto, Lucas and Barsbey, Melih and Mediano, Pedro A. M. and Birdal, Tolga , title =. 2025 , publisher =

work page 2025

[73] [73]

2021 , publisher =

Pu, Xingyue and Cao, Tianyue and Zhang, Xiaoyun and Dong, Xiaowen and Chen, Siheng , title =. 2021 , publisher =

work page 2021

[74] [74]

IEEE Access , volume =

Ryu, Junseung and Cho, Namkyeong and Hwang, Hyung Ju , title =. IEEE Access , volume =. 2025 , pages =

work page 2025

[75] [75]

, title =

Sanchez-Lengeling, Benjamin and Reif, Emily and Pearce, Adam and Wiltschko, Alexander B. , title =. Distill , volume =. 2021 , pages =

work page 2021

[76] [76]

2023 , publisher =

Sanford, Clayton and Hsu, Daniel and Telgarsky, Matus , title =. 2023 , publisher =

work page 2023

[77] [77]

and Gori, M

Scarselli, F. and Gori, M. and Ah Chung Tsoi and Hagenbuchner, M. and Monfardini, G. , title =. IEEE Transactions on Neural Networks , volume =. 2009 , pages =

work page 2009

[78] [78]

2021 , publisher =

Schölkopf, Bernhard and Locatello, Francesco and Bauer, Stefan and Ke, Nan Rosemary and Kalchbrenner, Nal and Goyal, Anirudh and Bengio, Yoshua , title =. 2021 , publisher =

work page 2021

[79] [79]

and Mateos, Gonzalo and Ribeiro, Alejandro , title =

Segarra, Santiago and Marques, Antonio G. and Mateos, Gonzalo and Ribeiro, Alejandro , title =. 2016 , publisher =

work page 2016

[80] [80]

2025 , publisher =

Si, Chongjie and Zhang, Debing and Shen, Wei , title =. 2025 , publisher =

work page 2025