pith. machine review for the scientific record. sign in

arxiv: 2602.20223 · v3 · submitted 2026-02-23 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multimodal learningtabular learningTabPFNprior-data fitted networksmodality projectorscross-attentionfoundation models
0
0 comments X

The pith

MMPFN extends TabPFN to handle images and text with tabular data through embedding projectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes MultiModalPFN to extend the TabPFN foundation model for multimodal tabular learning. It adds components to incorporate non-tabular modalities like images and text by converting their embeddings into a form the model can use directly. A sympathetic reader would care because many practical datasets mix structured numbers with unstructured content, and unified models could simplify and improve results in areas such as medical diagnosis. The design centers on modality projectors that employ a multi-head gated MLP and cross-attention pooler to handle context and attention issues. Tests across datasets confirm better results than current leading approaches.

Core claim

The Multi-Modal Prior-data Fitted Network (MMPFN) extends TabPFN to handle tabular and non-tabular modalities in a unified manner through per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors transform non-tabular embeddings into tabular-compatible tokens. A multi-head gated MLP and a cross-attention pooler are introduced to extract richer context from non-tabular inputs while mitigating attention imbalance. Experiments on medical and general-purpose multimodal datasets show consistent outperformance over state-of-the-art methods and effective use of non-tabular modalities alongside tabular features.

What carries the argument

The modality projectors that bridge non-tabular and tabular data by transforming embeddings into compatible tokens using a multi-head gated MLP and cross-attention pooler.

If this is right

  • Multimodal data from healthcare and marketing can be processed more effectively in a single model.
  • Non-tabular modalities are exploited to enhance predictions without separate processing streams.
  • The framework provides a scalable way to extend prior-data fitted networks to heterogeneous inputs.
  • Attention imbalance common in multimodal setups is addressed through the pooler design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extensions to other modalities could follow similar projection strategies for broader foundation model use.
  • Comparisons with other multimodal architectures might show efficiency gains from the tabular token approach.
  • Applications in new domains like autonomous systems combining sensor data and images could be explored.

Load-bearing premise

The modality projectors and the multi-head gated MLP plus cross-attention pooler can reliably convert non-tabular embeddings into tabular-compatible tokens without substantial information loss or attention imbalance.

What would settle it

Running MMPFN on a multimodal dataset and finding its performance no better than or worse than competitive methods, or detecting high information loss via embedding similarity measures before and after projection.

Figures

Figures reproduced from arXiv: 2602.20223 by Chaeyoung Song, Hanul Kim, Wall Kim.

Figure 1
Figure 1. Figure 1: An overview of MMPFN. MMPFN extends TabPFN by incorporating per-modality encoders and a modality projector to extract features from non-tabular data. Newly developed components are highlighted in color, while existing ones appear in gray. Layers marked as ‘frozen’ remain fixed during fine-tuning, whereas all others are trainable. Encoded target labels are part of the training inputs but are omitted from th… view at source ↗
Figure 2
Figure 2. Figure 2: Performance on PU20 versus the number of non￾tabular tokens. (a) Image-only results with DINOv2 and an MLP baseline. (b) Multimodal results under token imbalance. The y￾axis shows accuracy and the x-axis shows the number of MGM heads. In (b), MGM+CAP uses 24 CAP heads. 2 4 8 16 32 64 0.8 0.6 0.4 0.2 PU20 Salary Tabular Image Tabular Text # Non-tabular tokens Attention mass (ratio) 2 4 8 16 32 64 128 256 [… view at source ↗
Figure 3
Figure 3. Figure 3: Token count and attention mass. Attention mass for tabular and non-tabular tokens is measured as the number of non￾tabular tokens varies, without CAP. Values are averaged over 12 self-attention layers in TabPFN. PU20 and Salary use 11 and 4 tabular tokens. The x-axis is in log scale. In [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity between multimodal feature em￾beddings. Axes denote all tabular and text/image features. From left to right and top to bottom, it shows the correlations between features in the experiments on the PU20, Calc, Cloth, Mass, Petfinder, and Airbnb datasets. Cross-Modal Correlation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of AutoGluon vs. MMPFN on PetFinder under different modality combinations: tabular, +text, +image, +image+text. pretrained foundation models and fine-tuning lightweight components, MMPFN achieved strong accuracy with sub￾stantially lower training costs. Across medical and general￾purpose benchmarks, it consistently outperformed competi￾tive state-of-the-art methods, scaled positively as modalities… view at source ↗
read the original abstract

Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at https://github.com/too-z/MultiModalPFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MultiModalPFN (MMPFN), an extension of TabPFN for multimodal tabular learning. It adds per-modality encoders and modality projectors (a multi-head gated MLP combined with a cross-attention pooler) that map non-tabular embeddings from images and text into tabular-compatible tokens, enabling unified processing within the prior-data fitted network framework. Experiments on medical and general-purpose multimodal datasets are reported to show consistent outperformance over competitive state-of-the-art baselines while addressing attention imbalance.

Significance. If the empirical results hold under rigorous verification, the work is significant because it provides a practical, scalable route to extend prior-data fitted networks to heterogeneous multimodal settings that arise in healthcare and marketing. The release of source code supports reproducibility and allows independent validation of the modality projectors' effectiveness.

major comments (2)
  1. [§4] §4 (Experiments): the central claim of consistent outperformance over SOTA baselines is load-bearing for the paper's contribution, yet the manuscript does not report the number of independent runs, dataset sizes, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values); without these, the reliability of the reported gains cannot be assessed.
  2. [§3.2] §3.2 (Modality Projectors): the description of the cross-attention pooler states that it mitigates attention imbalance, but no quantitative ablation isolates its contribution versus the multi-head gated MLP alone; this is needed to confirm that the design choice is responsible for the observed gains rather than other factors.
minor comments (2)
  1. The abstract would be strengthened by naming the specific baselines, key metrics (e.g., accuracy, AUC), and number of datasets used.
  2. [Figure 1] Figure 1 (architecture diagram): ensure all components of the modality projectors are explicitly labeled and that the flow from non-tabular embeddings to tabular tokens is visually unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address each of the major comments point by point below. We have made revisions to the manuscript to incorporate additional experimental details and ablations as suggested, which we believe strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the central claim of consistent outperformance over SOTA baselines is load-bearing for the paper's contribution, yet the manuscript does not report the number of independent runs, dataset sizes, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values); without these, the reliability of the reported gains cannot be assessed.

    Authors: We agree with the referee that providing the number of independent runs, dataset sizes, and statistical significance is essential for validating the empirical claims. In the revised version, we have added a new table summarizing dataset statistics including sizes, specified that all results are averaged over 5 independent runs with different random seeds, and included p-values from paired t-tests comparing MMPFN to baselines in the main results tables (Section 4). These additions confirm the statistical significance of the observed improvements. revision: yes

  2. Referee: [§3.2] §3.2 (Modality Projectors): the description of the cross-attention pooler states that it mitigates attention imbalance, but no quantitative ablation isolates its contribution versus the multi-head gated MLP alone; this is needed to confirm that the design choice is responsible for the observed gains rather than other factors.

    Authors: We acknowledge that an ablation study isolating the cross-attention pooler would provide stronger evidence for its specific contribution. We have performed this ablation and added the results to Section 3.2 and the supplementary material. The ablation shows that adding the cross-attention pooler to the multi-head gated MLP yields an average improvement of 2.3% in AUC across the medical datasets, supporting its role in addressing attention imbalance. We have also included visualizations of attention weights to further illustrate the effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical architecture extension of TabPFN via modality projectors (multi-head gated MLP and cross-attention pooler) and reports performance on external multimodal datasets. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Central claims rest on experimental comparisons to SOTA baselines and released code, which constitute independent external benchmarks. Minor self-citations to TabPFN are not load-bearing for the new multimodal components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work relies on standard neural network components and pre-trained foundation models without detailing any ad-hoc fitted values or new postulates.

pith-pipeline@v0.9.0 · 5513 in / 1007 out tokens · 25400 ms · 2026-05-15T20:08:25.420187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens... we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TabPFN-3: Technical Report

    cs.LG 2026-05 unverdicted novelty 6.0

    TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Tabstar: A foundation tabular model with semantically target-aware rep- resentations

    Alan Arazi, Eilam Shapira, and Roi Reichart. Tabstar: A foundation tabular model with semantically target-aware rep- resentations. InNeurIPS, 2025. 6

  2. [2]

    Scarf: Self-supervised contrastive learning using random feature corruption

    Dara Bahri, Heinrich Jiang, Yi Tay, and Donald Metzler. Scarf: Self-supervised contrastive learning using random feature corruption. InICLR, 2022. 1

  3. [3]

    Revisiting multimodal transformers for tabular data with text fields

    Thomas Bonnier. Revisiting multimodal transformers for tabular data with text fields. InFindings of ACL, pages 1481– 1500, 2024. 2, 4, 5, 6

  4. [4]

    M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation. InFindings of ACL, pages 2318–2335, Bangkok, Thailand, 2024. ACL. 2

  5. [5]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pages 785–794, 2016. 1

  6. [6]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InECCV, pages 104–120, 2020. 2

  7. [7]

    Le, and Christo- pher D

    Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christo- pher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. InICLR, 2020. 2, 4, 6, 3

  8. [8]

    Deep multimodal fusion of im- age and non-image data in disease diagnosis and prognosis: a review.Progress in Biomedical Engineering, 5(2):022001,

    Can Cui, Haichun Yang, Yaohong Wang, Shilin Zhao, Zuhayr Asad, Lori A Coburn, Keith T Wilson, Bennett A Landman, and Yuankai Huo. Deep multimodal fusion of im- age and non-image data in disease diagnosis and prognosis: a review.Progress in Biomedical Engineering, 5(2):022001,

  9. [9]

    Towards the development of an explainable e-commerce fake review index: An at- tribute analytics approach.European Journal of Operational Research, 317(2):382–400, 2024

    Ronnie Das, Wasim Ahmed, Kshitij Sharma, Mariann Hardey, Yogesh K Dwivedi, Ziqi Zhang, Chrysostomos Apostolidis, and Raffaele Filieri. Towards the development of an explainable e-commerce fake review index: An at- tribute analytics approach.European Journal of Operational Research, 317(2):382–400, 2024. 1

  10. [10]

    Language modeling with gated convolutional net- works

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional net- works. InICML, pages 933–941, 2017. 4

  11. [11]

    The road less scheduled

    Aaron Defazio, Xingyu Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, and Ashok Cutkosky. The road less scheduled. InNeurIPS, pages 9974–10007, 2024. 4

  12. [12]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 4

  13. [13]

    Tip: Tabular-image pre- training for multimodal classification with incomplete data

    Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, De- clan P O’Regan, and Chen Qin. Tip: Tabular-image pre- training for multimodal classification with incomplete data. InECCV, pages 478–496, 2024. 2, 5, 7, 8, 4

  14. [14]

    Stil: Semi-supervised tabular-image learning for comprehensive task-relevant information exploration in multimodal classifi- cation

    Siyi Du, Xinzhe Luo, Declan P O’Regan, and Chen Qin. Stil: Semi-supervised tabular-image learning for comprehensive task-relevant information exploration in multimodal classifi- cation. InCVPR, pages 15549–15559, 2025. 2, 7, 4

  15. [15]

    Lanistr: Multimodal learning from structured and un- structured data.arXiv preprint arXiv:2305.16556, 2023

    Sayna Ebrahimi, Sercan O Arik, Yihe Dong, and Tomas Pfis- ter. Lanistr: Multimodal learning from structured and un- structured data.arXiv preprint arXiv:2305.16556, 2023. 2

  16. [16]

    Eva: Exploring the limits of masked visual representa- tion learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, pages 19358–19369, 2023. 2

  17. [17]

    Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

  18. [18]

    A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern- hard Sch¨olkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012. 1

  19. [19]

    A package for learning on tabular and text data with transformers

    Ken Gu and Akshay Budhkar. A package for learning on tabular and text data with transformers. InProceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 69–73, 2021. 2

  20. [20]

    Best of both worlds: Multimodal contrastive learning with tabular and imaging data

    Paul Hager, Martin J Menten, and Daniel Rueckert. Best of both worlds: Multimodal contrastive learning with tabular and imaging data. InCVPR, pages 23924–23935, 2023. 1, 2, 5, 6, 4

  21. [21]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InICLR, 2021. 2, 4

  22. [22]

    De- BERTav3: Improving deBERTa using ELECTRA-style pre- training with gradient-disentangled embedding sharing

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. De- BERTav3: Improving deBERTa using ELECTRA-style pre- training with gradient-disentangled embedding sharing. In ICLR, 2023. 2

  23. [23]

    Tabllm: Few- shot classification of tabular data with large language mod- els

    Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Mon- ica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few- shot classification of tabular data with large language mod- els. InAISTATS, pages 5549–5581, 2023. 2

  24. [24]

    Healnet: multimodal fusion for heterogeneous biomed- ical data

    Konstantin Hemker, Nikola Simidjievski, and Mateja Jam- nik. Healnet: multimodal fusion for heterogeneous biomed- ical data. InNeurIPS, 2024. 2, 5

  25. [25]

    Tabpfn: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel M ¨uller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. InICLR, 2023. 1, 2, 5, 6

  26. [26]

    Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025

    Noah Hollmann, Samuel M ¨uller, Lennart Purucker, Ar- jun Krishnakumar, Max K ¨orfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025. 1, 2, 4, 5

  27. [27]

    Fusion of medical imag- ing and electronic health records using deep learning: a sys- tematic review and implementation guidelines.NPJ digital medicine, 3(1):136, 2020

    Shih-Cheng Huang, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P Lungren. Fusion of medical imag- ing and electronic health records using deep learning: a sys- tematic review and implementation guidelines.NPJ digital medicine, 3(1):136, 2020. 1

  28. [28]

    Tabular insights, visual impacts: transferring expertise from tables to images

    Jun-Peng Jiang, Han-Jia Ye, Leye Wang, Yang Yang, Yuan Jiang, and De-Chuan Zhan. Tabular insights, visual impacts: transferring expertise from tables to images. InICML, 2024. 4

  29. [29]

    Multimodal tabular reasoning with privileged structured information

    Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qing- Guo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, and Han-Jia Ye. Multimodal tabular reasoning with privileged structured information. InNeurIPS, 2025. 2

  30. [30]

    Melbourne airbnb open data.kaggle.com/ datasets / tylerx / melbourne - airbnb - open - data, 2018

    Kaggle. Melbourne airbnb open data.kaggle.com/ datasets / tylerx / melbourne - airbnb - open - data, 2018. Accessed: September 24, 2025. 2, 4, 5

  31. [31]

    Women’s e-commerce clothing reviews

    Kaggle. Women’s e-commerce clothing reviews. kaggle . com / datasets / nicapotato / womens - ecommerce- clothing- reviews, 2019. Accessed: September 24, 2025. 5

  32. [32]

    Petfinder.my adoption prediction.https : //www.kaggle.com/competitions/petfinder- adoption- prediction, 2019

    Kaggle. Petfinder.my adoption prediction.https : //www.kaggle.com/competitions/petfinder- adoption- prediction, 2019. Accessed: September 24, 2025. 5

  33. [33]

    Predict the data scientist’s salary in india.kaggle

    Kaggle. Predict the data scientist’s salary in india.kaggle. com / datasets / ankitkalauni / predict - the - data-scientists-salary-in-india, 2021. Ac- cessed: September 24, 2025. 2, 5

  34. [34]

    Light- gbm: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Light- gbm: A highly efficient gradient boosting decision tree. In NeurIPS, 2017. 1

  35. [35]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742, 2023. 2

  36. [36]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- mant baseline for vision and language.arXiv preprint arXiv:1908.03557, 2019. 2

  37. [37]

    Tab2Text - a framework for deep learning with tabular data

    Tong Lin, Jason Yan, David Jurgens, and Sabina J Tomkins. Tab2Text - a framework for deep learning with tabular data. InFindings of EMNLP, pages 12925–12935, 2024. 2

  38. [38]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916,

  39. [39]

    Deep transfer learning with joint adaptation net- works

    Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation net- works. InICML, pages 2208–2217, 2017. 2

  40. [40]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 5

  41. [41]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. InNeurIPS, 2019. 2

  42. [42]

    Time: Tabpfn- integrated multimodal engine for robust tabular-image learn- ing.arXiv preprint arXiv:2506.00813, 2025

    Jiaqi Luo, Yuan Yuan, and Shixin Xu. Time: Tabpfn- integrated multimodal engine for robust tabular-image learn- ing.arXiv preprint arXiv:2506.00813, 2025. 2, 5

  43. [43]

    Towards benchmarking foundation models for tabular data with text

    Martin Mr ´az, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text. InICML Workshop, 2025. 4

  44. [44]

    DINOv2: Learning robust visual features without su- pervision.Trans

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without su- pervision.Trans. Mach. Learn. Res., 2024. 2, 4, 6, 3

  45. [45]

    Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones.Data in Brief, 32:106221, 2020

    Andre GC Pacheco, Gustavo R Lima, Amanda S Salo- mao, Breno Krohling, Igor P Biral, Gabriel G De Angelo, F´abio CR Alves Jr, Jos ´e GM Esgario, Alana C Simora, Pe- dro BC Castro, et al. Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones.Data in Brief, 32:106221, 2020. 2, 4, 5

  46. [46]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2, 7, 8

  47. [47]

    Catboost: Un- biased boosting with categorical features

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: Un- biased boosting with categorical features. InNeurIPS, 2018. 1, 5, 6

  48. [48]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 2

  49. [49]

    Sawyer-Lee, F

    R. Sawyer-Lee, F. Gimenez, A. Hoogi, and D. Rubin. Cu- rated breast imaging subset of digital database for screening mammography (cbis-ddsm). The Cancer Imaging Archive,

  50. [50]

    J ¨org Schilcher, Alva Nilsson, Oliver Andlid, and Anders Eklund. Fusion of electronic health records and radio- graphic images for a multimodal deep learning prediction model of atypical femur fractures.Computers in Biology and Medicine, 168:107704, 2024. 1

  51. [51]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

  52. [52]

    Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021

    Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021. 1

  53. [53]

    Vl-bert: Pre-training of generic visual- linguistic representations

    Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual- linguistic representations. InICLR, 2020. 2

  54. [54]

    Mul- timodal temporal fusion transformers are good product de- mand forecasters.IEEE Trans

    Maarten Sukel, Stevan Rudinac, and Marcel Worring. Mul- timodal temporal fusion transformers are good product de- mand forecasters.IEEE Trans. Multimedia, 31(2):48–60,

  55. [55]

    How to fine-tune bert for text classification? InChina national con- ference on Chinese computational linguistics, pages 194– 206, 2019

    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text classification? InChina national con- ference on Chinese computational linguistics, pages 194– 206, 2019. 6

  56. [56]

    LXMERT: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. LXMERT: Learning cross- modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5100–5111, 2019. 2

  57. [57]

    Autogluon-multimodal (automm): Supercharging multimodal automl with foundation models

    Zhiqiang Tang, Haoyang Fang, Su Zhou, Taojiannan Yang, Zihan Zhong, Cuixiong Hu, Katrin Kirchhoff, and George Karypis. Autogluon-multimodal (automm): Supercharging multimodal automl with foundation models. InICAML, pages 15/1–35, 2024. 2, 4, 5, 6

  58. [58]

    Bag of tricks for multimodal automl with image, text, and tabular data.arXiv preprint arXiv:2412.16243, 2024

    Zhiqiang Tang, Zihan Zhong, Tong He, and Gerald Fried- land. Bag of tricks for multimodal automl with image, text, and tabular data.arXiv preprint arXiv:2412.16243, 2024. 2, 4

  59. [59]

    Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. InACL, 2019. 5, 6

  60. [60]

    Smarter, better, faster, longer: A modern bidirectional en- coder for fast, memory efficient, and long context finetuning and inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavi ´e, Orion Weller, Oskar Hallstr ¨om, Said Taghadouini, Alexis Gal- lagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Grif- fin Thomas Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional en- coder for fast, memory efficient, and long context finetuning and inference. ...

  61. [61]

    Association for Computational Linguistics. 2

  62. [62]

    Tensor fusion network for mul- timodal sentiment analysis

    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for mul- timodal sentiment analysis. InEMNLP, pages 1103–1114,

  63. [63]

    Deep mul- timodal data fusion.ACM Comput

    Fei Zhao, Chengcui Zhang, and Baocheng Geng. Deep mul- timodal data fusion.ACM Comput. Surv., 56(9):1–36, 2024. 1

  64. [64]

    ibot: Image bert pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InICLR, 2022. 2

  65. [65]

    Cross-modal domain adaptation in brain disease diagnosis: Maximum mean discrepancy-based convolutional neural networks

    Xuran Zhu. Cross-modal domain adaptation in brain disease diagnosis: Maximum mean discrepancy-based convolutional neural networks. InInt. Conf. Commun., Inf. Syst. Comput. Eng., pages 1515–1519, 2024. 1 MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning Supplementary Material S1. Additional Analysis Attention Imbalance.Fig...