pith. sign in

arxiv: 2606.04485 · v2 · pith:MXS674B6new · submitted 2026-06-03 · 💻 cs.LG

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

Pith reviewed 2026-06-28 06:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords tabular foundation modelsRaBEL tokenizationlow-rank collapseattention bottlenecksS->N->F reorderingmodel efficiencyTabPFN
0
0 comments X

The pith

RaBEL tokenization and S->N->F reordering enable a 2M-parameter tabular model to outperform larger baselines with lower costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard affine scalar tokenization in tabular foundation models channels each feature through a one-dimensional path, producing weak value sensitivity and redundant hidden states in early layers. RaBEL expands scalars into compact localized RBF features to raise effective rank and conditioning, while the S->N->F block reorders computation to aggregate cross-sample context before feature mixing and applies attention pooling. These changes produce LimiX-2M, a 2M-parameter model that beats larger TabPFN-v2 and TabICL baselines on common tabular benchmarks at reduced training and inference cost. The results indicate that value-aware tokenization and readout-aligned routing can improve the accuracy-efficiency trade-off. Readers care because the approach offers a path to stronger tabular models without proportional increases in scale or compute.

Core claim

Low-rank collapse and attention bottlenecks in TFMs arise from affine scalar tokenization that injects value variation through an essentially one-dimensional channel and from routing that fails to align with readout. RaBEL expands each scalar into compact localized RBF features, optionally exponent-gated, to improve conditioning and shallow-layer effective rank. The reordered bidirectional block S->N->F aggregates cross-sample context before feature mixing and uses attention pooling. Together these produce LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs.

What carries the argument

RaBEL tokenization expands each scalar into compact localized RBF features (optionally exponent-gated) to raise value sensitivity and effective rank; paired with S->N->F reordered bidirectional blocks that aggregate cross-sample context before feature mixing and apply attention pooling.

If this is right

  • LimiX-2M outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks.
  • Training and inference costs decrease relative to the larger baselines.
  • Shallow-layer effective rank rises because each feature now carries richer localized value variation.
  • Redundant hidden states decrease once computation is reordered to align with readout.
  • Value-aware tokenization and readout-aligned routing become key levers for the accuracy-efficiency trade-off in TFMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization change could be tested on time-series or graph data where scalar inputs similarly constrain early-layer expressivity.
  • Combining these components with larger parameter counts might produce further gains beyond the 2M scale demonstrated.
  • Designs for other foundation models could adopt localized RBF-style expansions when feature values are the bottleneck rather than sequence length.

Load-bearing premise

Low-rank collapse and attention bottlenecks are the main performance limiters in current TFMs and are directly resolved by RaBEL tokenization plus S->N->F reordering without other unaccounted factors driving the gains.

What would settle it

A controlled ablation where standard tokenization plus matched compute matches or exceeds LimiX-2M accuracy, or where removing the RBF expansion or the S->N->F reordering eliminates the reported gains.

Figures

Figures reproduced from arXiv: 2606.04485 by Chun Yuan, Gang Ren, Han Yu, Hao Yuan, Li Mao, Mingchao Hao, Peng Cui, Xingxuan Zhang, Yuanrui Wang, Yunjia Zhang.

Figure 1
Figure 1. Figure 1: Rank comparison across layers of LimiX-2M and TabPFN-v2. The metric Rank@99% and Rank@95% represents the minimum number of SVD components required to take up 99% or 95% energy measured by singular values. The current embedding strategy adopted by TabPFN-v2 is simply mapping each cell, i.e. each scalar, to the high￾dimensional hidden space via a 1×p linear projection. Such a straightforward strategy implici… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of attention scores (a) DAGs of the gen￾erated datasets. (b) Feature Attention Heatmap of FSN. (c) Fea￾ture Attention Heatmap of SNF. While FSN is dominated by self￾attention, SNF demonstrates a broader attentional span that effec￾tively targets neighboring features. 5.1.2. TRANSFORMER-BASED METHODS We integrate different embedding methods into a 2M￾parameter transformer backbone with the sam… view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive ablation studies on TabArena (top) and TabZilla (bottom) benchmarks. The visualization is divided into two distinct analyses: the Module Ablation (left) demonstrates the incremental performance gains (AUC) attributed to the integration of RaBEL and RBA modules into the baseline; the Structural Abla￾tion (right) evaluates the impact of different topological orderings of components (Feature int… view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative Percentage of Optimal Achievements of Baseline2M on TabArena. C.7. Fine-grained Dataset-level Comparison We conducted a fine-grained dataset-level comparison on the TabArena (classification) and CTR23 (regression) benchmarks to evaluate the number of datasets where each model achieves the leading performance. Our comparison set includes established baselines such as TabPFN-v2, TabICL, Mitra, XGB… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative Percentage of Optimal Achievements of LimiX-2M on TabArena. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative Percentage of Optimal Achievements of Baseline2M on CTR23. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative Percentage of Optimal Achievements of LimiX-2M on CTR23. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified tokenize-and-route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block S->N->F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy--efficiency trade-off in TFMs. Model checkpoints and inference code are available at https://github.com/limix-ldm-ai/LimiX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript introduces a tokenize-and-route framework for tabular foundation models to address low-rank collapse arising from standard affine scalar tokenization (where each feature injects variation through a one-dimensional channel) and attention bottlenecks. It proposes RaBEL tokenization, which expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, together with a reordered bidirectional block using S->N->F routing that aggregates cross-sample context before feature mixing and employs attention pooling to align with readout. These modifications produce the LimiX-2M model (2M parameters) claimed to outperform larger TabPFN-v2 and TabICL baselines on standard tabular benchmarks while lowering training and inference costs. Checkpoints and inference code are released.

Significance. If the empirical results hold, the work identifies value-aware tokenization and readout-aligned routing as practical levers for improving the accuracy-efficiency trade-off in TFMs. The public release of model checkpoints and inference code is a clear strength that supports reproducibility and enables independent verification or extension.

minor comments (1)
  1. Abstract: the phrase 'widely used tabular benchmarks' is used without naming the specific datasets or providing a forward reference to the experimental section or table that lists them; this reduces immediate clarity for readers assessing the scope of the superiority claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on an empirical comparison: RaBEL tokenization and S->N->F reordering are proposed as architectural modifications that increase effective rank and align with readout, with the resulting 2M-parameter model shown to outperform baselines on tabular benchmarks. No derivation chain reduces a claimed prediction or first-principles result to its own fitted inputs by construction, nor does any load-bearing step rely on self-citation of an unverified uniqueness theorem or ansatz. The abstract and provided text present the improvements as measured outcomes rather than tautological redefinitions or statistically forced predictions. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the paper introduces new architectural components whose internal design choices (number of RBF centers, gating decisions) likely involve free parameters; no explicit axioms or invented physical entities are described.

free parameters (1)
  • RBF feature count per scalar
    The number of compact localized RBF features used to expand each scalar is a design hyperparameter that affects conditioning and rank.
axioms (1)
  • domain assumption Reordered bidirectional attention (S->N->F) aligns computation with readout and improves cross-sample aggregation before feature mixing.
    The abstract presents this reordering as a key lever without deriving it from first principles.
invented entities (1)
  • RaBEL tokenization no independent evidence
    purpose: Expands scalar values into compact localized RBF features to improve early-layer effective rank.
    New method introduced by the paper; no independent evidence outside the work is provided.

pith-pipeline@v0.9.1-grok · 5766 in / 1250 out tokens · 28364 ms · 2026-06-28T06:58:32.215832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 3 linked inside Pith

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages =

    Understanding the difficulty of training deep feedforward neural networks , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages =. 2010 , editor =

  3. [3]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle=. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

  4. [4]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  5. [5]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  6. [6]

    Nature , volume =

    Accurate predictions on small data with a tabular foundation model , author =. Nature , volume =. 2025 , doi =

  7. [7]

    arXiv preprint arXiv:2207.01848 , year =

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second , author =. arXiv preprint arXiv:2207.01848 , year =

  8. [8]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

    TabICL: A Tabular Foundation Model for In-Context Learning on Large Data , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

  9. [9]

    2016 , doi =

    Chen, Tianqi and Guestrin, Carlos , booktitle =. 2016 , doi =

  10. [10]

    Advances in Neural Information Processing Systems 30 (NeurIPS) , pages =

    LightGBM: A Highly Efficient Gradient Boosting Decision Tree , author =. Advances in Neural Information Processing Systems 30 (NeurIPS) , pages =

  11. [11]

    Advances in neural information processing systems , volume=

    CatBoost: unbiased boosting with categorical features , author=. Advances in neural information processing systems , volume=

  12. [12]

    arXiv preprint arXiv:2207.08815 , year =

    Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? , author =. arXiv preprint arXiv:2207.08815 , year =

  13. [13]

    arXiv preprint arXiv:2012.06678 , year =

    TabTransformer: Tabular Data Modeling Using Contextual Embeddings , author =. arXiv preprint arXiv:2012.06678 , year =

  14. [14]

    Advances in Neural Information Processing Systems 34 (NeurIPS) , year =

    Revisiting Deep Learning Models for Tabular Data , author =. Advances in Neural Information Processing Systems 34 (NeurIPS) , year =

  15. [15]

    Advances in Neural Information Processing Systems 35 (NeurIPS) , pages =

    On Embeddings for Numerical Features in Tabular Deep Learning , author =. Advances in Neural Information Processing Systems 35 (NeurIPS) , pages =

  16. [16]

    Complex Systems , volume =

    Multivariable Functional Interpolation and Adaptive Networks , author =. Complex Systems , volume =

  17. [17]

    Neural Computation , volume =

    Universal Approximation Using Radial-Basis-Function Networks , author =. Neural Computation , volume =

  18. [18]

    2018 , publisher=

    Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2018 , publisher=

  19. [19]

    2006 , publisher =

    Gaussian Processes for Machine Learning , author =. 2006 , publisher =

  20. [20]

    Advances in Neural Information Processing Systems 20 (NeurIPS) , pages =

    Random Features for Large-Scale Kernel Machines , author =. Advances in Neural Information Processing Systems 20 (NeurIPS) , pages =

  21. [21]

    Williams, Christopher K. I. and Seeger, Matthias , booktitle =. Using the Nystr

  22. [22]

    arXiv preprint arXiv:2206.08564 , year =

    MET: Masked Encoding for Tabular Data , author =. arXiv preprint arXiv:2206.08564 , year =

  23. [23]

    Advances in Neural Information Processing Systems 33 (NeurIPS) , year =

    VIME: Extending the Success of Self- and Semi-Supervised Learning to Tabular Domain , author =. Advances in Neural Information Processing Systems 33 (NeurIPS) , year =

  24. [24]

    arXiv preprint arXiv:2509.03505 , year=

    LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence , author=. arXiv preprint arXiv:2509.03505 , year=

  25. [25]

    Bayan and Goldstein, Tom , journal=

    Somepalli, Gowthami and Goldblum, Micah and Schwarzschild, Avi and Bruss, C. Bayan and Goldstein, Tom , journal=

  26. [26]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    TabNet: Attentive Interpretable Tabular Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  27. [27]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  28. [28]

    Advances in neural information processing systems , volume=

    Mitra: Mixed synthetic priors for enhancing tabular foundation models , author=. Advances in neural information processing systems , volume=

  29. [29]

    2026 , eprint=

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models , author=. 2026 , eprint=

  30. [30]

    arXiv e-prints , pages=

    A closer look at tabpfn v2: Strength, limitation, and extension , author=. arXiv e-prints , pages=

  31. [31]

    arXiv preprint arXiv:1708.03731 , year=

    Openml benchmarking suites , author=. arXiv preprint arXiv:1708.03731 , year=

  32. [32]

    Advances in Neural Information Processing Systems , volume=

    When do neural nets outperform boosted trees on tabular data? , author=. Advances in Neural Information Processing Systems , volume=

  33. [33]

    arXiv preprint arXiv:2506.16791 , year=

    TabArena: A living benchmark for machine learning on tabular data , author=. arXiv preprint arXiv:2506.16791 , year=

  34. [34]

    AutoML Conference 2023 (Workshop) , year=

    OpenML-CTR23--a curated tabular regression benchmarking suite , author=. AutoML Conference 2023 (Workshop) , year=