Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes
Pith reviewed 2026-05-19 09:55 UTC · model grok-4.3
The pith
The Biased Scan Attention Transformer Neural Process matches or exceeds leading model accuracy on spatiotemporal tasks while training faster and scaling to over a million points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Biased Scan Attention Transformer Neural Process, built from Kernel Regression Blocks, group-invariant attention biases, and memory-efficient Biased Scan Attention, matches or exceeds the accuracy of the best existing models while often training in a fraction of the time. It exhibits translation invariance that supports simultaneous learning at multiple resolutions, models processes that evolve in both space and time, supports high-dimensional fixed effects, and performs inference on over one million test points together with one hundred thousand context points in under a minute on a single 24 GB GPU.
What carries the argument
Biased Scan Attention together with group-invariant attention biases and Kernel Regression Blocks, which enforce translation invariance and enable efficient attention over large context and target sets in a neural process.
If this is right
- Inference becomes practical on over one million test points and one hundred thousand context points in under a minute on a single GPU.
- Multi-resolution learning occurs automatically because of the built-in translation invariance.
- High-dimensional fixed effects can be included without breaking scalability.
- Space and time evolution can be modeled transparently inside the same neural process framework.
- Training often finishes in a fraction of the time needed by prior high-accuracy models.
Where Pith is reading between the lines
- The same bias technique could be added to other attention-based models to improve their spatial awareness.
- Domains such as robotics and geology that already use large spatiotemporal datasets may gain from the reduced training time.
- Testing on processes that lack translation invariance would map the precise limits of the architecture's strengths.
- Hybrid combinations with existing Gaussian process approximations might produce even more flexible stochastic models.
Load-bearing premise
The processes being modeled are fully or partially translation-invariant, so that the group-invariant biases and multi-resolution learning deliver their reported advantages.
What would settle it
A direct comparison on a dataset generated from clearly non-translation-invariant processes in which the new model loses its accuracy edge or its speed advantage over standard neural processes would falsify the central claim.
Figures
read the original abstract
Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data-hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this trade-off is often unnecessary, particularly when modeling fully or partially translation-invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high-dimensional fixed effects, and (5) scale gracefully, running inference on over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is provided as part of the `dl4bi` package.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Biased Scan Attention Transformer Neural Process (BSA-TNP), a Neural Process architecture that incorporates Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). It claims to match or exceed the accuracy of leading models while training faster, exhibit translation invariance that enables simultaneous multi-resolution learning, transparently model processes evolving in both space and time, support high-dimensional fixed effects, and scale inference to over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is released in the dl4bi package.
Significance. If the central claims hold, the work would be a meaningful contribution to scalable spatiotemporal modeling in Neural Processes, potentially reducing the accuracy-scalability trade-off for translation-invariant processes common in climate, epidemiology, and robotics applications. The release of code supports reproducibility and is a clear strength.
major comments (2)
- [Methods (architecture description)] The claim that BSA-TNP exhibits translation invariance (enabling multi-resolution learning) rests on the group-invariant attention biases in the KRBlocks and BSA mechanism, yet no derivation or proof is provided showing that scan ordering preserves equivariance under combined space-time translations. This is load-bearing for claims (2) and (5) in the abstract.
- [Experiments] The abstract asserts strong performance numbers (accuracy matching or exceeding baselines, training in a fraction of the time, scaling to 1M+ points), but the manuscript provides insufficient experimental details, specific baselines, error bars, or ablation studies on the invariance and multi-resolution properties to support these.
minor comments (2)
- Notation for the biased scan attention and kernel regression blocks could be introduced more gradually with a small illustrative example to aid readers new to scan-based attention.
- Figure captions should explicitly state the number of runs or seeds used for reported metrics.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for identifying areas where additional rigor and detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns.
read point-by-point responses
-
Referee: [Methods (architecture description)] The claim that BSA-TNP exhibits translation invariance (enabling multi-resolution learning) rests on the group-invariant attention biases in the KRBlocks and BSA mechanism, yet no derivation or proof is provided showing that scan ordering preserves equivariance under combined space-time translations. This is load-bearing for claims (2) and (5) in the abstract.
Authors: We agree that a formal derivation is needed to rigorously establish that the biased scan ordering preserves the desired equivariance properties under space-time translations. In the revised manuscript we will add an appendix containing a step-by-step proof that the combination of group-invariant attention biases and the fixed but translation-equivariant scan ordering maintains the required invariance. This addition will directly support claims (2) and (5). revision: yes
-
Referee: [Experiments] The abstract asserts strong performance numbers (accuracy matching or exceeding baselines, training in a fraction of the time, scaling to 1M+ points), but the manuscript provides insufficient experimental details, specific baselines, error bars, or ablation studies on the invariance and multi-resolution properties to support these.
Authors: We acknowledge that the current experimental section would benefit from greater transparency. In the revision we will expand the experiments to: (i) list all baselines with precise citations and implementation details, (ii) report error bars from at least five independent runs with different random seeds, (iii) include dedicated ablation studies that isolate the contribution of the translation-invariance mechanisms and the multi-resolution training regime, and (iv) provide additional hardware, timing, and dataset statistics for the large-scale inference experiments. These changes will better substantiate the performance claims in the abstract. revision: yes
Circularity Check
No circularity: architecture claims rest on independent design choices, not self-definition or fitted inputs.
full rationale
The abstract and design summary present KRBlocks, group-invariant attention biases, and Biased Scan Attention as explicit architectural innovations whose properties (translation invariance, multi-resolution learning, spatiotemporal modeling) are asserted to follow from the construction rather than being presupposed by it. No equations, predictions, or central claims are shown to reduce by definition to fitted parameters or prior self-citations; performance and scaling results are framed as empirical outcomes. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Target processes are fully or partially translation-invariant
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Group-invariant attention biases... RBF-networks: B(h)_ij = sum_f a_f exp(-b_f ||q_i^omega - k_j^omega||^2) (Eq. 4); Theorem 2 on G-invariance of BSA-TNP
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KRBlock... iterative kernel regression... O(n_c^2 + n_c n_t) complexity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Spectral Transformer Neural Processes
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
Reference graph
Works this paper leans on
-
[1]
Tom R. Andersson, Wessel P. Bruinsma, Stratis Markou, James Requeima, Alejandro Coca- Castro, Anna Vaughan, Anna-Louise Ellis, Matthew A. Lazzara, Dani Jones, Scott Hosking, and et al. Environmental sensor placement with convolutional gaussian neural processes. Environmental Data Science, 2:e32, 2023
work page 2023
-
[2]
Matthew Ashman, Cristiana Diaconu, Junhyuck Kim, Lakee Sivaraya, Stratis Markou, James Requeima, Wessel P Bruinsma, and Richard E. Turner. Translation equivariant transformer neural processes. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Inter...
work page 1924
-
[3]
Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, and Richard E. Turner. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024
work page 2024
-
[4]
Self-supervised learning from images with a joint- embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[5]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
Leveraging redundancy in attention with reuse transformers, 2022
Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Freder- ick Liu, Yin-Wen Chang, and Sanjiv Kumar. Leveraging redundancy in attention with reuse transformers, 2022
work page 2022
-
[7]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018
work page 2018
-
[8]
Beijing Multi-Site Air Quality [dataset]
Song Chen. Beijing Multi-Site Air Quality [dataset]. UCI Machine Learning Repository, 2017
work page 2017
-
[9]
Generating long sequences with sparse transformers, 2019
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019
work page 2019
-
[10]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021
work page 2021
-
[11]
Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020
Copernicus Climate Change Service. Near surface meteorological variables from 1979 to 2018 derived from bias-corrected reanalysis, 2020. Dataset
work page 1979
-
[12]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[13]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[14]
Flex attention: A programming model for generating optimized attention kernels, 2024
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 10
work page 2024
-
[15]
Latent bottlenecked attentive neural processes
Leo Feng, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[16]
Memory efficient neural processes via constant memory attention block, 2024
Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Memory efficient neural processes via constant memory attention block, 2024
work page 2024
-
[17]
Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 170...
work page 2018
-
[18]
Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018
work page 2018
-
[19]
Jonathan Gordon, Wessel P. Bruinsma, Andrew Y . K. Foong, James Requeima, Yann Dubois, and Richard E. Turner. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020
work page 2020
-
[20]
Accurate predictions on small data with a tabular foundation model
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model. 637(8045):319–326. Publisher: Nature Publishing Group
-
[21]
Graph neural processes for spatio-temporal extrapolation
Junfeng Hu, Yuxuan Liang, Zhencheng Fan, Hongyang Chen, Yu Zheng, and Roger Zimmer- mann. Graph neural processes for spatio-temporal extrapolation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 752–763. ACM, August 2023
work page 2023
-
[22]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 18–24 Jul 2021
work page 2021
-
[23]
Learning attentive neural processes for planning with pushing actions, 2025
Atharv Jain, Seiji Shaw, and Nicholas Roy. Learning attentive neural processes for planning with pushing actions, 2025
work page 2025
-
[24]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020
work page 2020
-
[25]
Probability Theory: A Comprehensive Course
Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer Nature, third edition edition, 2020
work page 2020
-
[26]
Set transformer: A framework for attention-based permutation-invariant neural networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning , pages 3744–3753, 2019
work page 2019
-
[27]
Neural process for uncertainty-aware geospatial modeling
Guiye Li and Guofeng Cao. Neural process for uncertainty-aware geospatial modeling. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI ’24, pages 106–109. Association for Computing Machinery, 2024
work page 2024
-
[28]
E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141– 142, 1964
work page 1964
-
[29]
Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022
-
[30]
Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro
Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and acceler- ated probabilistic programming in numpyro. arXiv preprint arXiv:1912.11554, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[31]
Train short, test long: Attention with linear biases enables input length extrapolation
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. 11
work page 2022
-
[32]
Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory, 2022
work page 2022
-
[33]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[34]
Efficient attention: Attention with linear complexities
Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 2021
work page 2021
-
[35]
Sparse sinkhorn attention, 2020
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention, 2020
work page 2020
-
[36]
A computer movie simulating urban growth in the detroit region
Waldo R Tobler. A computer movie simulating urban growth in the detroit region. Economic geography, 46(sup1):234–240, 1970
work page 1970
-
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[38]
A. Vaughan, W. Tebbutt, J. S. Hosking, and R. E. Turner. Convolutional conditional neural processes for local climate downscaling. Geoscientific Model Development, 15(1):251–268, 2022
work page 2022
-
[39]
Anna Vaughan, Stratis Markou, Will Tebbutt, James Requeima, Wessel P. Bruinsma, Tom R. Andersson, Michael Herzog, Nicholas D. Lane, Matthew Chantry, J. Scott Hosking, and Richard E. Turner. Aardvark weather: end-to-end data-driven weather forecasting, 2024
work page 2024
-
[40]
Li, Madian Khabsa, Han Fang, and Hao Ma
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020
work page 2020
-
[41]
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021
work page 2021
-
[42]
Lazyformer: Self attention with lazy update, 2021
Chengxuan Ying, Guolin Ke, Di He, and Tie-Yan Liu. Lazyformer: Self attention with lazy update, 2021
work page 2021
-
[43]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. 12 A Model Parameterizations ConvCNP: We use the “off-grid” version of ConvCNP since we trai...
work page 2020
-
[44]
for µ-almost-every x, f(·, x), g(·, x) are probability measures,
-
[45]
If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x
for all A, f(A, ·), g(A, ·) are measurable functions. If for all A, for µ-almost-every x, f(A, x) = g(A, x), then f(·, x) = g(·, x) for µ-almost-every x. Proof. First consider sets of the form A = ( a1, b1) × . . . × (ak, bk) with ai, bi ∈ Q for all i = 1 , . . . , k, that is open rectangles in Rk with rational coordinates, and enumerate this count- able ...
-
[46]
By the assumption of the theorem Embedφ is G-invariant in dc, dt
-
[47]
The kernels used in the attention bias are G-invariant. As no other calculation within the attention mechanism involve d{c,t}, we get that BSA is G-invariant. Thus, BSA(qs, dq, ks, dk) = BSA( qs, g · dq, ks, g · dk). Consequently, e′ c = BSA( ec, g · dc, ec, g · dc) = BSA( ec, dc, ec, dc) and e′ t = BSA( et, g · dt, ec, g · dc) = BSA(et, dt, ec, dc). As a...
-
[48]
The projection head only takes the encoding e′ t output by the final KRBlock, and thus agnostic to dc, dt. By the above, BSA-TNP consists only of G-invariant operations: Embedφ, followed by an arbitrary number of KRBlocks, and the projection head, and therefore stacking these operations results in a G-invariant model in dc, dt. 20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.