pith. sign in

arxiv: 2605.25949 · v1 · pith:QARXCA5Znew · submitted 2026-05-25 · 💻 cs.LG · cs.AI· physics.comp-ph

Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

Pith reviewed 2026-06-29 22:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.comp-ph
keywords neural PDE solverswavelet transformparameter-efficient learninginductive biasmultiscale modelingTheWell benchmarksfoundation modelsarchitectural priors
0
0 comments X

The pith

Wavelet-multiscale priors enable 1-10M parameter PDE solvers to match 100-1000x larger foundation models on wave and acoustic benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that architectural inductive bias is more effective than model scale for neural PDE solvers. It introduces WaveLiT, which uses a discrete wavelet transform for multi-resolution tokenization together with a shared-weight multiscale pyramid and auxiliary loss to embed physical structure. Bespoke small WaveLiT models achieve competitive accuracy against much larger foundation models across eight TheWell benchmarks. The largest improvements occur on wave and acoustic problems where the prior aligns with the dynamics and errors do not compound under rollout. A jointly trained 10M-parameter variant displays interpretable transfer patterns that track how well the wavelet-multiscale structure matches each task.

Core claim

WaveLiT combines a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000× their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- stronges

What carries the argument

WaveLiT architecture that performs lossless tokenization via discrete wavelet transform and enforces multiscale structure through a shared-weight feature pyramid.

If this is right

  • Parameter-efficient models can reach foundation-model accuracy on PDE tasks when the inductive bias matches the dominant physics.
  • Gains concentrate on wave and acoustic problems because small per-step errors remain stable under long rollouts.
  • Joint training across benchmarks produces transfer that is strongest for dynamics aligned with the wavelet-multiscale prior.
  • The pattern of where a prior succeeds or fails supplies an empirical map of what structure it encodes.
  • Single-GPU training becomes feasible once model size is reduced by two to three orders of magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Error-pattern analysis across tasks could be used to select or combine multiple priors for a given physical regime.
  • The same wavelet tokenization approach might extend to other scientific domains that exhibit clear scale separation, such as turbulence or climate fields.
  • Parameter budgets could be allocated preferentially to problems whose dynamics fit an existing strong prior rather than to uniform scaling.
  • Failure signatures might serve as a diagnostic for diagnosing missing physical mechanisms in a learned solver.

Load-bearing premise

Observed performance gaps are caused by the wavelet-multiscale architectural choices rather than differences in training data, optimizer, or preprocessing between the small models and the large baselines.

What would settle it

Retraining the large foundation models under identical data splits, preprocessing, optimizer settings, and training schedules as the WaveLiT models and finding that the accuracy gap on wave benchmarks disappears.

Figures

Figures reproduced from arXiv: 2605.25949 by Hanwen Wang, Paris Perdikaris, Shyam Sankaran.

Figure 1
Figure 1. Figure 1: Parameter efficiency on PDEArena Navier-Stokes [13]. Each point is a model; lower￾left is better. WaveLiT (red stars) outperforms models with 100× more parameters. Neural surrogates for partial differential equations are increasingly central to scientific computing, with applications spanning weather forecasting [5], fluid dynamics, and materials science [45]. The promise is straightforward: where classica… view at source ↗
Figure 2
Figure 2. Figure 2: The WaveLiT architecture: (a) Input fields are tokenized via a single-level 2D discrete wavelet transform (DWT), projected to embedding dimension via a linear layer, and processed by NL multiscale mixing blocks. The output is recovered by a final linear projection and inverse DWT, restoring the original spatial resolution. (b) Each multiscale mixing block applies a shared￾weight linear attention mixer at L… view at source ↗
Figure 3
Figure 3. Figure 3: Foundation model design: (i) input standardization via a per-channel lifting matrix without [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: WaveLiT bespoke models vs. foundation model baselines across eight TheWell benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Foundation model diagnostic. Relative VRMSE (normalized per dataset to the best model; [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MILA models (triangles) demonstrate significantly lower compute costs for comparable [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparative performance of Dot-Product Attention (DPA) and Enhanced Linear Attention [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Radially Averaged Power Spectral Den￾sity (RAPSD) of the prediction error for the WaveLiT-9.5M model (1 wavelet level), compar￾ing training with both MSE and L1 wavelet loss terms (“With Wavelet Loss") versus training with MSE loss alone (“Without Wavelet Loss"). The inclusion of the wavelet loss demonstrably reduces error power across all frequencies. This benefit of the wavelet loss is further demon￾stra… view at source ↗
Figure 9
Figure 9. Figure 9: Median rollout VRMSE as a function of prediction step (TRL2D). All finetuned methods reduce error accumulation relative to the teacher-forced baseline, with gains widening at longer horizons. Pushforward degrades at both short and long steps as K increases, while Scheduled Sampling and CausalBPTT maintain low error throughout. Shaded bands show the interquartile range over test trajectories. 24 [PITH_FULL… view at source ↗
read the original abstract

Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000$\times$ their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior's failures is a useful empirical signal about its content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WaveLiT, a 1-10M-parameter architecture for neural PDE solvers that combines discrete wavelet transform tokenization, augmented linear attention, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. It claims these models achieve competitive performance against foundation models 100-1000× larger across eight TheWell benchmarks, with largest gains on wave- and acoustic-dominated tasks where the prior aligns with the dynamics; a jointly trained 10M-parameter variant exhibits a structured, physically interpretable transfer pattern, and the full pipeline trains on a single GPU. The results are presented as evidence that architectural inductive bias outperforms scale for this domain.

Significance. If the comparisons hold under controlled conditions, the work provides concrete evidence that domain-specific priors can yield substantial parameter efficiency in scientific ML, with practical benefits for single-GPU training. The structured transfer pattern across benchmarks offers a useful empirical diagnostic for what a prior captures, which is a novel contribution beyond raw performance numbers. The emphasis on where the model succeeds and fails is a methodological strength that could inform future architecture choices.

major comments (2)
  1. [Experimental section / abstract] The central claim that performance gains are attributable to the wavelet-multiscale inductive bias (rather than training disparities) is load-bearing and requires explicit confirmation that foundation-model baselines were trained under identical protocols. The abstract attributes gains to 'where the wavelet-multiscale prior fits the dominant dynamical structure' without stating that baselines were re-trained with the same optimizer, learning-rate schedule, batch size, data normalization, number of steps, or rollout length; this must be addressed in the experimental section with a dedicated protocol-matching table or statement.
  2. [Results / Tables] Quantitative support for the competitive performance claim is referenced in the abstract (eight benchmarks, largest gains on wave/acoustic tasks) but the provided text lacks tables, error bars, or rollout lengths; the full experimental results must include these details (e.g., per-benchmark MSE or rollout error curves) to allow verification that post-hoc selection or training differences do not drive the reported gaps.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by a brief parenthetical reference to the specific table or figure containing the main quantitative comparison.
  2. [Introduction] Clarify whether 'TheWell benchmarks' are drawn from a prior public dataset and provide the citation in the introduction or methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing experimental rigor. We agree that explicit protocol details and full quantitative results are essential to support the central claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experimental section / abstract] The central claim that performance gains are attributable to the wavelet-multiscale inductive bias (rather than training disparities) is load-bearing and requires explicit confirmation that foundation-model baselines were trained under identical protocols. The abstract attributes gains to 'where the wavelet-multiscale prior fits the dominant dynamical structure' without stating that baselines were re-trained with the same optimizer, learning-rate schedule, batch size, data normalization, number of steps, or rollout length; this must be addressed in the experimental section with a dedicated protocol-matching table or statement.

    Authors: We agree this confirmation is required for the claim to hold. The original submission's experimental section described our training setup but did not explicitly compare protocols with the cited foundation-model baselines. In the revision we will add a dedicated 'Training Protocol Equivalence' subsection containing a table that lists optimizer, learning-rate schedule, batch size, data normalization, number of steps, and rollout length for both WaveLiT and the re-trained baselines, confirming they are identical. revision: yes

  2. Referee: [Results / Tables] Quantitative support for the competitive performance claim is referenced in the abstract (eight benchmarks, largest gains on wave/acoustic tasks) but the provided text lacks tables, error bars, or rollout lengths; the full experimental results must include these details (e.g., per-benchmark MSE or rollout error curves) to allow verification that post-hoc selection or training differences do not drive the reported gaps.

    Authors: We acknowledge that the submitted manuscript text did not contain the requested tables or error bars. The revision will expand the results section with a new table reporting per-benchmark MSE (with standard deviations across three seeds), rollout error curves for all eight TheWell tasks, and explicit rollout lengths. This will allow direct verification of the performance gaps and the pattern of gains on wave/acoustic-dominated benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark comparisons with no derivation reducing to inputs by construction

full rationale

The paper presents an architecture (WaveLiT) and reports direct empirical results on eight TheWell benchmarks, attributing relative performance to the wavelet-multiscale prior based on observed patterns. No mathematical derivation chain, first-principles prediction, or fitted parameter is invoked that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes are used to justify core claims. The transfer pattern is described as an observed outcome of joint training, not a prediction forced by the model equations. This is a standard empirical comparison paper whose central claims remain independent of any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are described; the architecture itself encodes the inductive bias but the abstract does not enumerate fitted constants or new postulated quantities.

pith-pipeline@v0.9.1-grok · 5777 in / 1311 out tokens · 27603 ms · 2026-06-29T22:12:26.775187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    John Wiley & Sons, 2004

    Tinku Acharya and Ping-Sing Tsai.JPEG2000 standard for image compression: concepts, algorithms and VLSI architectures. John Wiley & Sons, 2004

  3. [3]

    It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

  4. [4]

    Small Language Models are the Future of Agentic AI

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small language models are the future of agentic ai. arXiv preprint arXiv:2506.02153, 2025

  5. [5]

    Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

    Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan Weyn, Haiyu Dong, Anna Vaughan, et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

  6. [6]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

  7. [7]

    Conditional positional encodings for vision transformers.arXiv preprint arXiv:2102.10882, 2021

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers.arXiv preprint arXiv:2102.10882, 2021

  8. [8]

    Linear attention with global context: A multipole attention mechanism for vision and physics

    Alex Colagrande, Paul Caillon, Eva Feillet, and Alexandre Allauzen. Linear attention with global context: A multipole attention mechanism for vision and physics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3099–3108, 2025

  9. [9]

    Unsupervised cross-lingual representation learning at scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8440–8451, 2020

  10. [10]

    jax-wavelets: The 2D discrete wavelet transform for JAX, 2022

    Katherine Crowson. jax-wavelets: The 2D discrete wavelet transform for JAX, 2022

  11. [11]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12124–12134, 2022

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  13. [13]

    Towards multi-spatiotemporal-scale generalized pde modeling.arXiv preprint arXiv:2209.15616, 2022

    Jayesh K Gupta and Johannes Brandstetter. Towards multi-spatiotemporal-scale generalized pde modeling.arXiv preprint arXiv:2209.15616, 2022

  14. [14]

    Demystify mamba in vision: A linear attention perspective

    Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective. arXiv preprint arXiv:2405.16605, 2024

  15. [15]

    Dpot: Auto-regressive denoising operator transformer for large-scale pde pre-training.arXiv preprint arXiv:2403.03542, 2024

    Zhongkai Hao, Chang Su, Songming Liu, Julius Berner, Chengyang Ying, Hang Su, Anima Anandkumar, Jian Song, and Jun Zhu. Dpot: Auto-regressive denoising operator transformer for large-scale pde pre-training.arXiv preprint arXiv:2403.03542, 2024

  16. [16]

    Flax: A neural network library and ecosystem for JAX, 2024

    Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2024. 10

  17. [17]

    Poseidon: Efficient foundation models for pdes

    Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger Käppeli, Roberto Molinaro, Em- manuel de Bézenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37:72525–72624, 2024

  18. [18]

    CRC press, 1996

    Eugenio Hernández and Guido Weiss.A first course on wavelets. CRC press, 1996

  19. [19]

    Psychology press, 2014

    Geoffrey E Hinton and James A Anderson.Parallel models of associative memory: updated edition. Psychology press, 2014

  20. [20]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213– 13232. PMLR, 2023

  21. [21]

    Wavelet diffusion neural operator.arXiv preprint arXiv:2412.04833, 2024

    Peiyan Hu, Rui Wang, Xiang Zheng, Tao Zhang, Haodong Feng, Ruiqi Feng, Long Wei, Yue Wang, Zhi-Ming Ma, and Tailin Wu. Wavelet diffusion neural operator.arXiv preprint arXiv:2412.04833, 2024

  22. [22]

    Less is More: Recursive Reasoning with Tiny Networks

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025

  23. [23]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  24. [24]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  25. [25]

    Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

    Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

  26. [26]

    Operator learning: Algorithms and analysis.arXiv preprint arXiv:2402.15715, 2024

    Nikola B Kovachki, Samuel Lanthaler, and Andrew M Stuart. Operator learning: Algorithms and analysis.arXiv preprint arXiv:2402.15715, 2024

  27. [27]

    Fourier Neural Operator for Parametric Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differen- tial equations.arXiv preprint arXiv:2010.08895, 2020

  28. [28]

    Neural Operator: Graph Kernel Network for Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv preprint arXiv:2003.03485, 2020

  29. [29]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  30. [30]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  31. [31]

    Learning nonlinear operators via deeponet based on the universal approximation theorem of operators

    Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218–229, 2021

  32. [32]

    A theory for multiresolution signal decomposition: the wavelet repre- sentation.IEEE transactions on pattern analysis and machine intelligence, 11(7):674–693, 1989

    Stephane G Mallat. A theory for multiresolution signal decomposition: the wavelet repre- sentation.IEEE transactions on pattern analysis and machine intelligence, 11(7):674–693, 1989

  33. [33]

    Multiple physics pretraining for physical surrogate models.arXiv preprint arXiv:2310.02994, 2023

    Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Holden Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for physical surrogate models.arXiv preprint arXiv:2310.02994, 2023. 11

  34. [34]

    Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

    Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Fran- cois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

  35. [35]

    Physix: A foundation model for physics simulations.arXiv preprint arXiv:2506.17774, 2025

    Tung Nguyen, Arsh Koneru, Shufan Li, and Aditya Grover. Physix: A foundation model for physics simulations.arXiv preprint arXiv:2506.17774, 2025

  36. [36]

    The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

    Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina Agocs, Miguel Beneitez, Marsha Berger, Blakesly Burkhart, Stuart Dalziel, Drummond Fielding, et al. The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

  37. [37]

    Nomad: Nonlinear manifold decoders for operator learning.Advances in Neural Information Processing Systems, 35:5601–5613, 2022

    Jacob Seidman, Georgios Kissas, Paris Perdikaris, and George J Pappas. Nomad: Nonlinear manifold decoders for operator learning.Advances in Neural Information Processing Systems, 35:5601–5613, 2022

  38. [38]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  39. [39]

    The lifting scheme: A construction of second generation wavelets.SIAM journal on mathematical analysis, 29(2):511–546, 1998

    Wim Sweldens. The lifting scheme: A construction of second generation wavelets.SIAM journal on mathematical analysis, 29(2):511–546, 1998

  40. [40]

    Wavelet neural operator: a neural operator for parametric partial differential equations.arXiv preprint arXiv:2205.02191, 2022

    Tapas Tripura and Souvik Chakraborty. Wavelet neural operator: a neural operator for parametric partial differential equations.arXiv preprint arXiv:2205.02191, 2022

  41. [41]

    MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

    Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Max- imilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

  42. [42]

    Scaling laws in patchification: An image is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025

    Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, and Cihang Xie. Scaling laws in patchification: An image is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025

  43. [43]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

  44. [44]

    Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

    Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

  45. [45]

    Micrometer: Micromechan- ics transformer for predicting mechanical responses of heterogeneous materials.arXiv preprint arXiv:2410.05281, 2024

    Sifan Wang, Tong-Rui Liu, Shyam Sankaran, and Paris Perdikaris. Micrometer: Micromechan- ics transformer for predicting mechanical responses of heterogeneous materials.arXiv preprint arXiv:2410.05281, 2024

  46. [46]

    Respecting causality is all you need for training physics-informed neural networks.arXiv preprint arXiv:2203.07404, 2022

    Sifan Wang, Shyam Sankaran, and Paris Perdikaris. Respecting causality is all you need for training physics-informed neural networks.arXiv preprint arXiv:2203.07404, 2022

  47. [47]

    Cvit: Continuous vision transformer for operator learning.arXiv preprint arXiv:2405.13998, 2024

    Sifan Wang, Jacob H Seidman, Shyam Sankaran, Hanwen Wang, George J Pappas, and Paris Perdikaris. Cvit: Continuous vision transformer for operator learning.arXiv preprint arXiv:2405.13998, 2024

  48. [48]

    mt5: A massively multilingual pre-trained text-to-text trans- former

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021

  49. [49]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  50. [50]

    Scaling vision transform- ers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 12

  51. [51]

    WaveLiT- 1.2M

    Zhenhai Zhu and Radu Soricut. Wavelet-based image tokenizer for vision transformers.arXiv preprint arXiv:2405.18616, 2024. 13 A Linear Attention and the Ridge Regression View From softmax to linear attention.Standard attention computes Attn(Q, K, V) = softmax(QK ⊤/ √ d)V , whose QK ⊤ matrix is N×N and makes both memory and compute scale quadratically with...