MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers
Pith reviewed 2026-06-28 07:04 UTC · model grok-4.3
The pith
MeshTok generates heterogeneous multiscale tokens by refining sharp regions on a fixed grid so a single Transformer sequence can handle both global context and local PDE details more efficiently than uniform patches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MeshTok produces a heterogeneous collection of multiscale tokens on a fixed simulation grid by refining regions with sharp gradients or multiscale structures, then processes the entire collection inside one unified Transformer sequence; this targeted allocation of tokens improves the efficiency-accuracy trade-off over uniform spatial partitions across several PDE benchmarks.
What carries the argument
Heterogeneous multiscale tokens generated by selective refinement on a fixed grid and fed into a single Transformer sequence.
If this is right
- PDE solutions with localized sharp features can be modeled with fewer total tokens while preserving accuracy.
- The same Transformer backbone works for both smooth and multiscale problems without redesign.
- Token count grows only where needed, offering a practical way to scale to larger domains.
- The method supplies an inductive bias that favors physically informative regions rather than uniform effort.
Where Pith is reading between the lines
- The fixed-grid constraint may limit applicability to problems where the underlying mesh itself must move or deform.
- Extending the refinement criterion beyond gradient magnitude to other indicators such as residual error could further improve results.
- Because tokens remain on a fixed underlying grid, post-processing to recover a continuous field may require additional interpolation steps not detailed in the work.
Load-bearing premise
A standard Transformer can process the mixed-size tokens from the adaptive scheme without extra architectural machinery and still extract both coarse global and fine local information.
What would settle it
Run the same PDE benchmarks with uniform-grid tokenization and with MeshTok; if the uniform version matches or exceeds MeshTok on accuracy per token or per FLOPs across the test set, the claimed improvement disappears.
Figures
read the original abstract
Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)-inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse-grained global context and fine-grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, which we view as a practical inductive bias rather than a formal optimality guarantee. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency-accuracy trade-off compared to uniform-grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling. Code is available at https://github.com/SCAILab-USTC/MeshTok.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MeshTok, an AMR-inspired adaptive tokenization scheme for PDE-solving Transformers. It generates heterogeneous multiscale tokens on a fixed simulation grid, selectively refining regions with sharp gradients or transient features, and feeds these tokens into a single unified Transformer sequence to capture both coarse global context and fine local details. The central claim is that this yields a better efficiency-accuracy trade-off than uniform-grid baselines across multiple PDE families, without requiring specialized architectural components, and the authors release public code.
Significance. If the central claim holds, MeshTok supplies a practical inductive bias for allocating compute to physically relevant regions in neural PDE models, which could improve scalability for multiscale problems. The public code release is a clear strength that enables direct verification and extension.
major comments (2)
- [Abstract, Section 3] Abstract and Section 3 (Method): The claim that the heterogeneous multiscale tokens are processed "without requiring specialized architectural components" is load-bearing for attributing any observed gains to the tokenization scheme alone. The manuscript must explicitly state (with pseudocode or architecture diagram) whether standard Transformer components—token embeddings, positional encodings, and attention—are used unmodified or whether scale-specific projections, padding, or attention masks are introduced to accommodate variable token sizes.
- [Section 4] Section 4 (Experiments): The abstract asserts "consistent improvements" in the efficiency-accuracy trade-off, yet no quantitative metrics, error bars, dataset sizes, or ablation tables are referenced in the provided text. Without these, it is impossible to assess whether the reported gains survive controls for token count or whether they are driven by the adaptive refinement itself.
minor comments (1)
- [Abstract] The abstract would benefit from a single sentence summarizing the magnitude of the reported gains (e.g., relative error reduction or FLOPs savings) even if full tables appear later.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on clarifying our architectural claims and strengthening the experimental reporting. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract, Section 3] Abstract and Section 3 (Method): The claim that the heterogeneous multiscale tokens are processed "without requiring specialized architectural components" is load-bearing for attributing any observed gains to the tokenization scheme alone. The manuscript must explicitly state (with pseudocode or architecture diagram) whether standard Transformer components—token embeddings, positional encodings, and attention—are used unmodified or whether scale-specific projections, padding, or attention masks are introduced to accommodate variable token sizes.
Authors: We agree that the claim requires explicit support. In the revision we will add both an architecture diagram and pseudocode to Section 3. These will show that (i) all tokens receive the same linear embedding projection, (ii) positional encodings are computed from token-center coordinates using the standard sinusoidal formulation, and (iii) a vanilla multi-head self-attention layer is applied to the concatenated sequence with no scale-specific projections, padding tokens, or attention masks. The only heterogeneity resides in the tokenization step itself; the Transformer treats every token identically. revision: yes
-
Referee: [Section 4] Section 4 (Experiments): The abstract asserts "consistent improvements" in the efficiency-accuracy trade-off, yet no quantitative metrics, error bars, dataset sizes, or ablation tables are referenced in the provided text. Without these, it is impossible to assess whether the reported gains survive controls for token count or whether they are driven by the adaptive refinement itself.
Authors: Section 4 of the full manuscript already contains the requested elements: relative L2 errors with standard deviations over five random seeds, explicit dataset cardinalities, and ablations that match total token count between MeshTok and uniform baselines. We will revise the abstract to cite these results directly (e.g., “12 % lower relative error at matched FLOPs, Table 3”) so that the quantitative support is visible without reading the full experimental section. revision: partial
Circularity Check
No circularity: empirical design choice evaluated on benchmarks
full rationale
The paper proposes MeshTok as an AMR-inspired adaptive tokenization method for PDE Transformers. It frames the approach as a practical inductive bias for better efficiency-accuracy trade-off, demonstrates gains via experiments across PDE families and datasets, and releases public code. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations exist. The central claim rests on empirical results rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adaptive refinement moderately increases token count but promotes targeted allocation as a practical inductive bias rather than a formal optimality guarantee.
Reference graph
Works this paper leans on
-
[1]
Journal of Computational Physics , keywords =
URL https://api.semanticscholar. org/CorpusID:119491298. Bar-Sinai, Y ., Hoyer, S., Hickey, J., and Brenner, M. P. Learning data-driven discretizations for partial differen- tial equations.Proceedings of the National Academy of Sciences, 116(31):15344–15349, 2019. Bengio, Y ., Ducharme, R., and Vincent, P. A neural prob- abilistic language model. InProcee...
-
[2]
org/CorpusID:218971783
URL https://api.semanticscholar. org/CorpusID:218971783. Cao, S. Choose a transformer: Fourier or galerkin. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 24924–24940. Curran Associates, Inc.,
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ d0921d442ee91b896ad95059d13df618-Paper. pdf. Chen, C.-F. R., Fan, Q., and Panda, R. Crossvit: Cross- attention multi-scale vision transformer for image clas- sification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 347–356, 2021. doi: 10.1109/ICCV48922.2021.00041. ...
-
[4]
URL https://www.sciencedirect.com/ science/article/pii/S002199912300476X
doi: https://doi.org/10.1016/j.jcp.2023.112381. URL https://www.sciencedirect.com/ science/article/pii/S002199912300476X. Freymuth, N., Dahlinger, P., W¨urth, T., Reisch, S., K¨arger, L., and Neumann, G. Swarm reinforcement learning for adaptive mesh refinement.Advances in neural informa- tion processing systems, 36:73312–73347, 2023. Gillette, A., Keith,...
-
[5]
URL https://api.semanticscholar. org/CorpusID:13905106. Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar, A., and Catanzaro, B. Adaptive fourier neural operators: Efficient token mixers for transformers.arXiv preprint arXiv:2111.13587, 2021. Guo, X., Li, W., and Iorio, F. Convolutional neural networks for steady flow approximation. InProceed- ings of ...
-
[6]
Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang
URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 854d9fca60b4bd07f9bb215d59ef5561-Paper. pdf. Hao, Z., Su, C., Liu, S., Berner, J., Ying, C., Su, H., Anand- kumar, A., Song, J., and Zhu, J. Dpot: auto-regressive denoising operator transformer for large-scale pde pre- training. InProceedings of the 41st International Confer- ence on Machine...
-
[7]
cc/paper_files/paper/2020/file/ 4b21cf96d4cf612f239a6c322b10c8fe-Paper
URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 4b21cf96d4cf612f239a6c322b10c8fe-Paper. pdf. Li, Z., Kovachki, N. B., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A. M., and Anandkumar, A. Fourier neural operator for parametric partial differential equations. In9th International Conference on Learn- ing Representations, ICLR 20...
arXiv 2020
-
[8]
URL https://api.semanticscholar. org/CorpusID:14337532. Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam.ArXiv, abs/1711.05101,
-
[9]
URL https://api.semanticscholar. org/CorpusID:3312944. Lu, L., Jin, P., and Karniadakis, G. E. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of opera- tors.arXiv preprint arXiv:1910.03193, 2019. Masliaev, M., Gusarov, D., Markov, I., and Hvatov, A. To- wards universal neural oper...
Pith/arXiv arXiv 1910
-
[10]
URL https://www.sciencedirect.com/ science/article/pii/S0045782524003657
doi: https://doi.org/10.1016/j.cma.2024.117109. URL https://www.sciencedirect.com/ science/article/pii/S0045782524003657. 12 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers Perez, E., Strub, F., De Vries, H., Dumoulin, V ., and Courville, A. Film: Visual reasoning with a general con- ditioning layer. InProceedings of the AAAI con...
-
[11]
URL https://www.sciencedirect.com/ science/article/pii/S0021999118307125. Rao, Y ., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021. Roohi, E. and Mahdavi, A. Shock-aware physics-guided fusion-deeponet o...
-
[12]
International Joint Conferences on Artificial Intel- ligence Organization, 8 2024. doi: 10.24963/ijcai.2024/
-
[13]
URL https://doi.org/10.24963/ijcai. 2024/573. Main Track. Wu, H., Luo, H., Wang, H., Wang, J., and Long, M. Tran- solver: A fast transformer solver for pdes on general geometries. InInternational Conference on Machine Learning, 2024. Xu, Z., Liu, J., Chen, K., Chen, Y ., Hu, Z., and Ni, B. Amr- transformer: Enabling efficient long-range interaction for co...
-
[14]
Zhu, Y ., Zabaras, N., Koutsourelakis, P.-S., and Perdikaris, P
URL https://www.sciencedirect.com/ science/article/pii/S0021999118302341. Zhu, Y ., Zabaras, N., Koutsourelakis, P.-S., and Perdikaris, P. Physics-constrained deep learning for high-dimensional surrogate modeling and uncer- tainty quantification without labeled data.Journal of Computational Physics, 394:56–81, 2019. ISSN 0021-9991. doi: https://doi.org/10...
-
[15]
14 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers A
URL https://www.sciencedirect.com/ science/article/pii/S0021999119303559. 14 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers A. Theoretical Analysis A.1. Theoretical Analysis of AMR A substantial body of prior work has demonstrated that modern neural architectures—including convolutional encoder– decoder surrogates and neural-ope...
2018
-
[16]
IfC qo = 1, then E(P2;g) =E ⋆(N(P2);g)≤ E(P 1;g)
-
[17]
Proof.We decompose the argument into four conceptual steps
More generally, for arbitraryC qo ≥1, E(P2;g)≤C qo E ⋆(N(P2);g)≤C qo E(P1;g). Proof.We decompose the argument into four conceptual steps. Since refinement never reduces the token count, we have N(P1)≤N(P 2). Therefore, the uniform partition P1 is a feasible candidate in the definition ofE ⋆(N(P2);g). By definition of the infimum, E ⋆(N(P2);g) = inf ˜P:N( ...
1978
-
[18]
The model consists of 8 stacked spectral convolution layers and operates directly on the full spatial grid, without relying on patch-based tokenization. • ViTThe Vision Transformer baseline follows a standard encoder-only Transformer architecture. It uses an embedding dimension of 512 and a feed-forward network dimension of 2048. The input field is tokeni...
arXiv 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.