Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

arxiv: 2506.09163 · v3 · submitted 2025-06-10 · 💻 cs.LG · stat.ML

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

Daniel Jenson , Jhonathan Navott , Piotr Grynfelder , Mengyan Zhang , Makkunda Sharma , Elizaveta Semenova , Seth Flaxman This is my paper

Pith reviewed 2026-05-19 09:55 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords neural processesspatiotemporal inferencetransformer attentiontranslation invariancescalable modelingkernel regressionbiased scan attention

0 comments p. Extension

The pith

The Biased Scan Attention Transformer Neural Process matches or exceeds leading model accuracy on spatiotemporal tasks while training faster and scaling to over a million points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to demonstrate that the usual accuracy-versus-scalability trade-off in neural processes is often unnecessary when the underlying stochastic process has translation invariance. The authors introduce the Biased Scan Attention Transformer Neural Process, which combines kernel regression blocks with group-invariant attention biases and a memory-efficient biased scan attention mechanism. These choices let the model learn at multiple resolutions at once, model joint space-time evolution directly, incorporate high-dimensional inputs, and run large inference tasks quickly on modest hardware. A reader would care because applications in climate, epidemiology, and similar domains routinely need exactly this combination of fidelity and speed on big datasets.

Core claim

The Biased Scan Attention Transformer Neural Process, built from Kernel Regression Blocks, group-invariant attention biases, and memory-efficient Biased Scan Attention, matches or exceeds the accuracy of the best existing models while often training in a fraction of the time. It exhibits translation invariance that supports simultaneous learning at multiple resolutions, models processes that evolve in both space and time, supports high-dimensional fixed effects, and performs inference on over one million test points together with one hundred thousand context points in under a minute on a single 24 GB GPU.

What carries the argument

Biased Scan Attention together with group-invariant attention biases and Kernel Regression Blocks, which enforce translation invariance and enable efficient attention over large context and target sets in a neural process.

If this is right

Inference becomes practical on over one million test points and one hundred thousand context points in under a minute on a single GPU.
Multi-resolution learning occurs automatically because of the built-in translation invariance.
High-dimensional fixed effects can be included without breaking scalability.
Space and time evolution can be modeled transparently inside the same neural process framework.
Training often finishes in a fraction of the time needed by prior high-accuracy models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bias technique could be added to other attention-based models to improve their spatial awareness.
Domains such as robotics and geology that already use large spatiotemporal datasets may gain from the reduced training time.
Testing on processes that lack translation invariance would map the precise limits of the architecture's strengths.
Hybrid combinations with existing Gaussian process approximations might produce even more flexible stochastic models.

Load-bearing premise

The processes being modeled are fully or partially translation-invariant, so that the group-invariant biases and multi-resolution learning deliver their reported advantages.

What would settle it

A direct comparison on a dataset generated from clearly non-translation-invariant processes in which the new model loses its accuracy edge or its speed advantage over standard neural processes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.09163 by Daniel Jenson, Elizaveta Semenova, Jhonathan Navott, Makkunda Sharma, Mengyan Zhang, Piotr Grynfelder, Seth Flaxman.

**Figure 2.** Figure 2: Susceptible-Infected-Recovered (SIR) tasks. Susceptible individuals are represented by [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: ERA5 ground surface temperature from a sample in northern Europe [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: An example of mean and uncertainty predictions of the Geodesic, RBF, and Embed variants [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Validation NLL on the SIR benchmark for seed 91. All seeds exhibited the same pattern. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: An example of a batch of multiresolution 2D GP tasks. Predictions are from BSA-TNP. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data-hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this trade-off is often unnecessary, particularly when modeling fully or partially translation-invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high-dimensional fixed effects, and (5) scale gracefully, running inference on over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is provided as part of the `dl4bi` package.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BSA-TNP adds a scan-based transformer variant to neural processes that targets large spatiotemporal datasets, but the translation-invariance claim looks fragile once scan ordering is considered.

read the letter

The core contribution is a new neural process architecture called BSA-TNP that combines Kernel Regression Blocks with biased scan attention and group-invariant attention biases. The goal is to keep accuracy competitive while scaling inference to over a million points on one GPU and supporting multi-resolution learning through claimed translation invariance. The paper also releases code in the dl4bi package, which helps with checking the implementation directly. These pieces address a real pain point in climate and epidemiology work where standard NPs or GPs hit memory walls too quickly. The design choices around space-time modeling and high-dimensional fixed effects are laid out clearly enough to follow the intended flow. The stress-test concern about scan ordering is worth taking seriously. Fixed scan order can introduce positional dependencies that group-invariant biases may not fully remove, especially when both space and time are involved; without an explicit equivariant ordering or a short derivation showing cancellation, the invariance claim stays approximate rather than guaranteed. The abstract states strong accuracy and speed numbers, yet the provided text gives no concrete baselines, error bars, or ablation tables, so those performance assertions cannot be weighed yet. This paper is aimed at researchers who already use neural processes or transformer variants for stochastic processes and need something that runs on bigger grids without custom kernels. A reader who cares about practical scaling in spatiotemporal settings will find the architecture description useful even if the invariance argument needs tightening. The work shows enough coherent thinking and engagement with the NP literature to merit referee time, though the experiments will decide how much revision is required. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Biased Scan Attention Transformer Neural Process (BSA-TNP), a Neural Process architecture that incorporates Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). It claims to match or exceed the accuracy of leading models while training faster, exhibit translation invariance that enables simultaneous multi-resolution learning, transparently model processes evolving in both space and time, support high-dimensional fixed effects, and scale inference to over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is released in the dl4bi package.

Significance. If the central claims hold, the work would be a meaningful contribution to scalable spatiotemporal modeling in Neural Processes, potentially reducing the accuracy-scalability trade-off for translation-invariant processes common in climate, epidemiology, and robotics applications. The release of code supports reproducibility and is a clear strength.

major comments (2)

[Methods (architecture description)] The claim that BSA-TNP exhibits translation invariance (enabling multi-resolution learning) rests on the group-invariant attention biases in the KRBlocks and BSA mechanism, yet no derivation or proof is provided showing that scan ordering preserves equivariance under combined space-time translations. This is load-bearing for claims (2) and (5) in the abstract.
[Experiments] The abstract asserts strong performance numbers (accuracy matching or exceeding baselines, training in a fraction of the time, scaling to 1M+ points), but the manuscript provides insufficient experimental details, specific baselines, error bars, or ablation studies on the invariance and multi-resolution properties to support these.

minor comments (2)

Notation for the biased scan attention and kernel regression blocks could be introduced more gradually with a small illustrative example to aid readers new to scan-based attention.
Figure captions should explicitly state the number of runs or seeds used for reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying areas where additional rigor and detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: [Methods (architecture description)] The claim that BSA-TNP exhibits translation invariance (enabling multi-resolution learning) rests on the group-invariant attention biases in the KRBlocks and BSA mechanism, yet no derivation or proof is provided showing that scan ordering preserves equivariance under combined space-time translations. This is load-bearing for claims (2) and (5) in the abstract.

Authors: We agree that a formal derivation is needed to rigorously establish that the biased scan ordering preserves the desired equivariance properties under space-time translations. In the revised manuscript we will add an appendix containing a step-by-step proof that the combination of group-invariant attention biases and the fixed but translation-equivariant scan ordering maintains the required invariance. This addition will directly support claims (2) and (5). revision: yes
Referee: [Experiments] The abstract asserts strong performance numbers (accuracy matching or exceeding baselines, training in a fraction of the time, scaling to 1M+ points), but the manuscript provides insufficient experimental details, specific baselines, error bars, or ablation studies on the invariance and multi-resolution properties to support these.

Authors: We acknowledge that the current experimental section would benefit from greater transparency. In the revision we will expand the experiments to: (i) list all baselines with precise citations and implementation details, (ii) report error bars from at least five independent runs with different random seeds, (iii) include dedicated ablation studies that isolate the contribution of the translation-invariance mechanisms and the multi-resolution training regime, and (iv) provide additional hardware, timing, and dataset statistics for the large-scale inference experiments. These changes will better substantiate the performance claims in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture claims rest on independent design choices, not self-definition or fitted inputs.

full rationale

The abstract and design summary present KRBlocks, group-invariant attention biases, and Biased Scan Attention as explicit architectural innovations whose properties (translation invariance, multi-resolution learning, spatiotemporal modeling) are asserted to follow from the construction rather than being presupposed by it. No equations, predictions, or central claims are shown to reduce by definition to fitted parameters or prior self-citations; performance and scaling results are framed as empirical outcomes. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into internal parameters; translation invariance is the main domain assumption stated.

axioms (1)

domain assumption Target processes are fully or partially translation-invariant
The architecture is introduced specifically for such processes to enable multi-resolution learning and invariance properties.

pith-pipeline@v0.9.0 · 5800 in / 1189 out tokens · 45623 ms · 2026-05-19T09:55:24.218581+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Group-invariant attention biases... RBF-networks: B(h)_ij = sum_f a_f exp(-b_f ||q_i^omega - k_j^omega||^2) (Eq. 4); Theorem 2 on G-invariance of BSA-TNP
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KRBlock... iterative kernel regression... O(n_c^2 + n_c n_t) complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spectral Transformer Neural Processes
cs.LG 2026-05 unverdicted novelty 6.0

STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Andersson, Wessel P

Tom R. Andersson, Wessel P. Bruinsma, Stratis Markou, James Requeima, Alejandro Coca- Castro, Anna Vaughan, Anna-Louise Ellis, Matthew A. Lazzara, Dani Jones, Scott Hosking, and et al. Environmental sensor placement with convolutional gaussian neural processes. Environmental Data Science, 2:e32, 2023

work page 2023
[2]

Matthew Ashman, Cristiana Diaconu, Junhyuck Kim, Lakee Sivaraya, Stratis Markou, James Requeima, Wessel P Bruinsma, and Richard E. Turner. Translation equivariant transformer neural processes. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Inter...

work page 1924
[3]

Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, and Richard E. Turner. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024

work page 2024
[4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Leveraging redundancy in attention with reuse transformers, 2022

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Freder- ick Liu, Yin-Wen Chang, and Sanjiv Kumar. Leveraging redundancy in attention with reuse transformers, 2022

work page 2022
[7]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

work page 2018
[8]

Beijing Multi-Site Air Quality [dataset]

Song Chen. Beijing Multi-Site Air Quality [dataset]. UCI Machine Learning Repository, 2017

work page 2017
[9]

Generating long sequences with sparse transformers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019

work page 2019
[10]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021

work page 2021
[11]

Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020

Copernicus Climate Change Service. Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020. Dataset

work page 1979
[12]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[13]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[14]

Flex attention: A programming model for generating optimized attention kernels, 2024

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 10

work page 2024
[15]

Latent bottlenecked attentive neural processes

Leo Feng, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[16]

Memory efficient neural processes via constant memory attention block, 2024

Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Memory efficient neural processes via constant memory attention block, 2024

work page 2024
[17]

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 170...

work page 2018
[18]

Rezende, S

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018

work page 2018
[19]

Bruinsma, Andrew Y

Jonathan Gordon, Wessel P. Bruinsma, Andrew Y . K. Foong, James Requeima, Yann Dubois, and Richard E. Turner. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020

work page 2020
[20]

Accurate predictions on small data with a tabular foundation model

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model. 637(8045):319–326. Publisher: Nature Publishing Group

work page
[21]

Graph neural processes for spatio-temporal extrapolation

Junfeng Hu, Yuxuan Liang, Zhencheng Fan, Hongyang Chen, Yu Zheng, and Roger Zimmer- mann. Graph neural processes for spatio-temporal extrapolation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 752–763. ACM, August 2023

work page 2023
[22]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 18–24 Jul 2021

work page 2021
[23]

Learning attentive neural processes for planning with pushing actions, 2025

Atharv Jain, Seiji Shaw, and Nicholas Roy. Learning attentive neural processes for planning with pushing actions, 2025

work page 2025
[24]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

work page 2020
[25]

Probability Theory: A Comprehensive Course

Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer Nature, third edition edition, 2020

work page 2020
[26]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning , pages 3744–3753, 2019

work page 2019
[27]

Neural process for uncertainty-aware geospatial modeling

Guiye Li and Guofeng Cao. Neural process for uncertainty-aware geospatial modeling. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI ’24, pages 106–109. Association for Computing Machinery, 2024

work page 2024
[28]

E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141– 142, 1964

work page 1964
[29]

Transformer neural processes: Uncertainty-aware meta learning via sequence modeling.ArXiv, abs/2207.04179, 2022

Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022

work page arXiv 2022
[30]

Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and acceler- ated probabilistic programming in numpyro. arXiv preprint arXiv:1912.11554, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[31]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. 11

work page 2022
[32]

Rabe and Charles Staats

Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory, 2022

work page 2022
[33]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[34]

Efficient attention: Attention with linear complexities

Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 2021

work page 2021
[35]

Sparse sinkhorn attention, 2020

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention, 2020

work page 2020
[36]

A computer movie simulating urban growth in the detroit region

Waldo R Tobler. A computer movie simulating urban growth in the detroit region. Economic geography, 46(sup1):234–240, 1970

work page 1970
[37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[38]

Vaughan, W

A. Vaughan, W. Tebbutt, J. S. Hosking, and R. E. Turner. Convolutional conditional neural processes for local climate downscaling. Geoscientific Model Development, 15(1):251–268, 2022

work page 2022
[39]

Bruinsma, Tom R

Anna Vaughan, Stratis Markou, Will Tebbutt, James Requeima, Wessel P. Bruinsma, Tom R. Andersson, Michael Herzog, Nicholas D. Lane, Matthew Chantry, J. Scott Hosking, and Richard E. Turner. Aardvark weather: end-to-end data-driven weather forecasting, 2024

work page 2024
[40]

Li, Madian Khabsa, Han Fang, and Hao Ma

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020

work page 2020
[41]

Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021

work page 2021
[42]

Lazyformer: Self attention with lazy update, 2021

Chengxuan Ying, Guolin Ke, Di He, and Tie-Yan Liu. Lazyformer: Self attention with lazy update, 2021

work page 2021
[43]

off-grid

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. 12 A Model Parameterizations ConvCNP: We use the “off-grid” version of ConvCNP since we trai...

work page 2020
[44]

for µ-almost-every x, f(·, x), g(·, x) are probability measures,

work page
[45]

If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x

for all A, f(A, ·), g(A, ·) are measurable functions. If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x. Proof. First consider sets of the form A = ( a1, b1) × . . . × (ak, bk) with ai, bi ∈ Q for all i = 1 , . . . , k, that is open rectangles in Rk with rational coordinates, and enumerate this count- able ...

work page
[46]

By the assumption of the theorem Embedφ is G-invariant in dc, dt

work page
[47]

As no other calculation within the attention mechanism involve d{c,t}, we get that BSA is G-invariant

The kernels used in the attention bias are G-invariant. As no other calculation within the attention mechanism involve d{c,t}, we get that BSA is G-invariant. Thus, BSA(qs, dq, ks, dk) = BSA( qs, g · dq, ks, g · dk). Consequently, e′ c = BSA( ec, g · dc, ec, g · dc) = BSA( ec, dc, ec, dc) and e′ t = BSA( et, g · dt, ec, g · dc) = BSA(et, dt, ec, dc). As a...

work page
[48]

The projection head only takes the encoding e′ t output by the final KRBlock, and thus agnostic to dc, dt. By the above, BSA-TNP consists only of G-invariant operations: Embedφ, followed by an arbitrary number of KRBlocks, and the projection head, and therefore stacking these operations results in a G-invariant model in dc, dt. 20

work page

[1] [1]

Andersson, Wessel P

Tom R. Andersson, Wessel P. Bruinsma, Stratis Markou, James Requeima, Alejandro Coca- Castro, Anna Vaughan, Anna-Louise Ellis, Matthew A. Lazzara, Dani Jones, Scott Hosking, and et al. Environmental sensor placement with convolutional gaussian neural processes. Environmental Data Science, 2:e32, 2023

work page 2023

[2] [2]

Matthew Ashman, Cristiana Diaconu, Junhyuck Kim, Lakee Sivaraya, Stratis Markou, James Requeima, Wessel P Bruinsma, and Richard E. Turner. Translation equivariant transformer neural processes. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Inter...

work page 1924

[3] [3]

Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, and Richard E. Turner. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024

work page 2024

[4] [4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[5] [5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Leveraging redundancy in attention with reuse transformers, 2022

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Freder- ick Liu, Yin-Wen Chang, and Sanjiv Kumar. Leveraging redundancy in attention with reuse transformers, 2022

work page 2022

[7] [7]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

work page 2018

[8] [8]

Beijing Multi-Site Air Quality [dataset]

Song Chen. Beijing Multi-Site Air Quality [dataset]. UCI Machine Learning Repository, 2017

work page 2017

[9] [9]

Generating long sequences with sparse transformers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019

work page 2019

[10] [10]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021

work page 2021

[11] [11]

Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020

Copernicus Climate Change Service. Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020. Dataset

work page 1979

[12] [12]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[13] [13]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[14] [14]

Flex attention: A programming model for generating optimized attention kernels, 2024

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 10

work page 2024

[15] [15]

Latent bottlenecked attentive neural processes

Leo Feng, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[16] [16]

Memory efficient neural processes via constant memory attention block, 2024

Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Memory efficient neural processes via constant memory attention block, 2024

work page 2024

[17] [17]

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 170...

work page 2018

[18] [18]

Rezende, S

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018

work page 2018

[19] [19]

Bruinsma, Andrew Y

Jonathan Gordon, Wessel P. Bruinsma, Andrew Y . K. Foong, James Requeima, Yann Dubois, and Richard E. Turner. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020

work page 2020

[20] [20]

Accurate predictions on small data with a tabular foundation model

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model. 637(8045):319–326. Publisher: Nature Publishing Group

work page

[21] [21]

Graph neural processes for spatio-temporal extrapolation

Junfeng Hu, Yuxuan Liang, Zhencheng Fan, Hongyang Chen, Yu Zheng, and Roger Zimmer- mann. Graph neural processes for spatio-temporal extrapolation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 752–763. ACM, August 2023

work page 2023

[22] [22]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 18–24 Jul 2021

work page 2021

[23] [23]

Learning attentive neural processes for planning with pushing actions, 2025

Atharv Jain, Seiji Shaw, and Nicholas Roy. Learning attentive neural processes for planning with pushing actions, 2025

work page 2025

[24] [24]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

work page 2020

[25] [25]

Probability Theory: A Comprehensive Course

Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer Nature, third edition edition, 2020

work page 2020

[26] [26]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning , pages 3744–3753, 2019

work page 2019

[27] [27]

Neural process for uncertainty-aware geospatial modeling

Guiye Li and Guofeng Cao. Neural process for uncertainty-aware geospatial modeling. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI ’24, pages 106–109. Association for Computing Machinery, 2024

work page 2024

[28] [28]

E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141– 142, 1964

work page 1964

[29] [29]

Transformer neural processes: Uncertainty-aware meta learning via sequence modeling.ArXiv, abs/2207.04179, 2022

Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022

work page arXiv 2022

[30] [30]

Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and acceler- ated probabilistic programming in numpyro. arXiv preprint arXiv:1912.11554, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[31] [31]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. 11

work page 2022

[32] [32]

Rabe and Charles Staats

Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory, 2022

work page 2022

[33] [33]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[34] [34]

Efficient attention: Attention with linear complexities

Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 2021

work page 2021

[35] [35]

Sparse sinkhorn attention, 2020

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention, 2020

work page 2020

[36] [36]

A computer movie simulating urban growth in the detroit region

Waldo R Tobler. A computer movie simulating urban growth in the detroit region. Economic geography, 46(sup1):234–240, 1970

work page 1970

[37] [37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[38] [38]

Vaughan, W

A. Vaughan, W. Tebbutt, J. S. Hosking, and R. E. Turner. Convolutional conditional neural processes for local climate downscaling. Geoscientific Model Development, 15(1):251–268, 2022

work page 2022

[39] [39]

Bruinsma, Tom R

Anna Vaughan, Stratis Markou, Will Tebbutt, James Requeima, Wessel P. Bruinsma, Tom R. Andersson, Michael Herzog, Nicholas D. Lane, Matthew Chantry, J. Scott Hosking, and Richard E. Turner. Aardvark weather: end-to-end data-driven weather forecasting, 2024

work page 2024

[40] [40]

Li, Madian Khabsa, Han Fang, and Hao Ma

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020

work page 2020

[41] [41]

Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021

work page 2021

[42] [42]

Lazyformer: Self attention with lazy update, 2021

Chengxuan Ying, Guolin Ke, Di He, and Tie-Yan Liu. Lazyformer: Self attention with lazy update, 2021

work page 2021

[43] [43]

off-grid

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. 12 A Model Parameterizations ConvCNP: We use the “off-grid” version of ConvCNP since we trai...

work page 2020

[44] [44]

for µ-almost-every x, f(·, x), g(·, x) are probability measures,

work page

[45] [45]

If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x

for all A, f(A, ·), g(A, ·) are measurable functions. If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x. Proof. First consider sets of the form A = ( a1, b1) × . . . × (ak, bk) with ai, bi ∈ Q for all i = 1 , . . . , k, that is open rectangles in Rk with rational coordinates, and enumerate this count- able ...

work page

[46] [46]

By the assumption of the theorem Embedφ is G-invariant in dc, dt

work page

[47] [47]

As no other calculation within the attention mechanism involve d{c,t}, we get that BSA is G-invariant

The kernels used in the attention bias are G-invariant. As no other calculation within the attention mechanism involve d{c,t}, we get that BSA is G-invariant. Thus, BSA(qs, dq, ks, dk) = BSA( qs, g · dq, ks, g · dk). Consequently, e′ c = BSA( ec, g · dc, ec, g · dc) = BSA( ec, dc, ec, dc) and e′ t = BSA( et, g · dt, ec, g · dc) = BSA(et, dt, ec, dc). As a...

work page

[48] [48]

The projection head only takes the encoding e′ t output by the final KRBlock, and thus agnostic to dc, dt. By the above, BSA-TNP consists only of G-invariant operations: Embedφ, followed by an arbitrary number of KRBlocks, and the projection head, and therefore stacking these operations results in a G-invariant model in dc, dt. 20

work page