pith. machine review for the scientific record. sign in

arxiv: 2605.07075 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

ModelLens: Finding the Best for Your Task from Myriads of Models

Muhao Chen, Qiyao Ma, Rui Cai, Weijie Jacky Mo, Wenhui Zhu, Xiaofei Wen, Xiwen Chen, Zhe Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords model selectionpretrained modelslatent spaceleaderboard dataperformance predictionmodel recommendationtransferabilityrouting
0
0 comments X

The pith

ModelLens learns a shared latent space from public leaderboard data to rank unseen models on unseen datasets without any target evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Public leaderboard records, though incomplete and noisy, contain enough consistent patterns for a model to predict which pretrained networks will perform well on entirely new tasks and data. ModelLens embeds models, datasets, and metrics into one performance-aware latent space so that proximity in that space indicates likely success on a fresh evaluation setting. The method therefore produces ranked recommendations for any new dataset using only the learned embeddings rather than running candidate models. It beats both metadata-only baselines and methods that require expensive forward passes on the target data. Top-K pools from ModelLens also raise the accuracy of several existing model-routing systems by as much as 81 percent on QA benchmarks.

Core claim

ModelLens learns a performance-aware latent space over model–dataset–metric tuples drawn from scattered public leaderboard interactions. This space lets the system rank previously unseen models on previously unseen datasets by predicted performance without executing any candidate on the target data. On a new benchmark containing 1.62 million evaluation records across 47K models and 9.6K datasets, the approach surpasses baselines that rely solely on metadata or that must run every candidate model.

What carries the argument

A performance-aware latent space over model–dataset–metric tuples that encodes capability patterns from heterogeneous leaderboard entries.

If this is right

  • Model selection for new tasks becomes possible without any per-model inference cost on the target dataset.
  • Recommended top-K model pools raise accuracy of downstream routing methods by up to 81 percent across QA benchmarks.
  • The same latent space supports generalization checks on both text-only and vision-language tasks from recently released benchmarks.
  • Continuous emergence of new models and datasets can be handled without maintaining exhaustive per-dataset records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large-scale collective evaluation data may serve as a practical substitute for exhaustive per-task benchmarking.
  • Model developers could gain from publishing standardized evaluation vectors that further enrich the latent atlas.
  • Extending the tuple representation to include training data statistics or architecture descriptors might sharpen predictions for related tasks.
  • The approach suggests a path toward automated model portfolios that adapt as the open-source ecosystem grows.

Load-bearing premise

Scattered and noisy public leaderboard interactions still contain a usable collective signal about which models succeed on which kinds of tasks.

What would settle it

Collect a fresh set of recently released models and datasets absent from the training records, obtain ModelLens rankings for them, then measure the actual performance of the top-ranked models on those datasets; large disagreement between predicted and measured order would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07075 by Muhao Chen, Qiyao Ma, Rui Cai, Weijie Jacky Mo, Wenhui Zhu, Xiaofei Wen, Xiwen Chen, Zhe Zhao.

Figure 1
Figure 1. Figure 1: Model recommendation in the wild. (Left) Atlas of ∼47K models (dots) and ∼9.6K datasets (⋆) laid out by a force-directed projection of our interaction-trained ecosystem structure rather than surface-level description similarity. The dashed circle marks the example dataset MMMU. (Right) Magnified view around MMMU: our framework retrieves the top-5 candidate models in this learned space (green numbered badge… view at source ↗
Figure 2
Figure 2. Figure 2: Learned size and family priors from model–dataset interactions. (Left) Model performance [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case studies on unseen datasets across domains. (Left) On NGQA, different tasks favor [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the learned model–dataset embedding space trained on interaction data.Each [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the model–dataset embedding space constructed using semantic (content [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of ablation results and feature importance analysis. [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
read the original abstract

The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ModelLens, a framework that learns a joint performance-aware latent space over model–dataset–metric tuples extracted from public leaderboards. It claims this space enables ranking of completely unseen models on unseen datasets without any forward passes on the target data. Experiments on a newly constructed benchmark of 1.62 M records (47 K models, 9.6 K datasets) report outperformance over metadata-only and per-model-evaluation baselines, plus up to 81 % gains when the recommended Top-K pools are fed to existing routing methods; case studies on recent text and vision-language benchmarks are also presented.

Significance. If the generalization claims hold under rigorous splits, the work would provide a scalable, zero-shot model-selection primitive that exploits the collective signal in heterogeneous leaderboards, potentially reducing the computational cost of model search in the open ecosystem and improving downstream routing pipelines.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Benchmark): the reported superiority over baselines is impossible to assess without an explicit description of the train/test split construction. In particular, it is unclear whether any model–dataset pair that appears in the training tuples is allowed to appear (even with a different metric) in the test set, which would constitute leakage and undermine the central claim of generalization to unseen pairs.
  2. [Abstract] Abstract: the statement that ModelLens 'surpasses baselines' is given without any quantitative deltas, confidence intervals, or statistical significance tests. Because the central contribution is an empirical improvement on a large but noisy dataset, the absence of these numbers makes it impossible to judge whether the gains are load-bearing or within noise.
  3. [§5] §5 (Routing experiments): the claim that Top-K pools improve routing methods by up to 81 % requires the exact definition of the routing baselines, the size of the candidate pools before and after ModelLens filtering, and whether the improvement is measured on the same held-out datasets used for the main benchmark or on additional data.
minor comments (2)
  1. [Abstract] The abstract mentions '1.62 M evaluation records' but does not state how many unique (model, dataset, metric) triples are involved after deduplication; this number should be reported for reproducibility.
  2. [§3] Notation for the latent-space model (e.g., how model, dataset, and metric embeddings are combined) is introduced without a clear equation or diagram in the provided text; a single equation or figure would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving clarity and rigor in our experimental descriptions. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Benchmark): the reported superiority over baselines is impossible to assess without an explicit description of the train/test split construction. In particular, it is unclear whether any model–dataset pair that appears in the training tuples is allowed to appear (even with a different metric) in the test set, which would constitute leakage and undermine the central claim of generalization to unseen pairs.

    Authors: We agree that an explicit description of the train/test split construction is necessary for assessing the validity of our generalization claims. In the revised version, we will add a dedicated paragraph in §4 detailing the split procedure: we partition at the level of unique model–dataset pairs to ensure that no pair (regardless of metric) from the training set appears in the test set. Models and datasets are held out entirely where possible to simulate the unseen setting, with the 1.62M records divided such that test tuples involve completely novel combinations. This prevents the leakage scenario described. revision: yes

  2. Referee: [Abstract] Abstract: the statement that ModelLens 'surpasses baselines' is given without any quantitative deltas, confidence intervals, or statistical significance tests. Because the central contribution is an empirical improvement on a large but noisy dataset, the absence of these numbers makes it impossible to judge whether the gains are load-bearing or within noise.

    Authors: We acknowledge that the abstract should include quantitative support. In the revision, we will update the abstract to report specific deltas (e.g., relative improvements over metadata-only and per-model baselines), along with references to confidence intervals and statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values) that are already computed in §4 but not summarized in the abstract. This will allow readers to assess the practical significance of the results. revision: yes

  3. Referee: [§5] §5 (Routing experiments): the claim that Top-K pools improve routing methods by up to 81 % requires the exact definition of the routing baselines, the size of the candidate pools before and after ModelLens filtering, and whether the improvement is measured on the same held-out datasets used for the main benchmark or on additional data.

    Authors: We agree that additional details are required for reproducibility and interpretation. In the revised §5, we will explicitly define the routing baselines (including their original implementations and hyperparameters), state the pre- and post-filtering pool sizes (e.g., full candidate pool of size N reduced to Top-K), and clarify that all routing improvements are measured on the identical held-out datasets from the main 1.62M-record benchmark to maintain consistency with the core evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation learns embeddings over observed model-dataset-metric tuples from public leaderboards and evaluates ranking performance on explicitly held-out unseen pairs (1.62 M records, 47 K models, 9.6 K datasets). No equation or claim reduces a prediction to a fitted input by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled via prior work. The held-out splits ensure that the reported gains test generalization rather than restating the training distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of free parameters, axioms, or invented entities; the latent-space model presumably contains standard embedding dimensions and regularization choices that are not detailed here.

pith-pipeline@v0.9.0 · 5561 in / 1148 out tokens · 42974 ms · 2026-05-11T01:52:45.644351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 10 internal anchors

  1. [1]

    Axcell: Automatic extraction of results from machine learning papers

    Marcin Kardas, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel, Ross Taylor, and Robert Stojnic. Axcell: Automatic extraction of results from machine learning papers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8580–8594, 2020

  2. [2]

    Open llm leader- board v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leader- board v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

  3. [3]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  4. [4]

    Automl: A survey of the state-of-the-art.Knowledge-based systems, 212:106622, 2021

    Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art.Knowledge-based systems, 212:106622, 2021

  5. [5]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  6. [6]

    Zero-shot automl with pretrained models

    Ekrem Öztürk, Fabio Ferreira, Hadi Jomaa, Lars Schmidt-Thieme, Josif Grabocka, and Frank Hutter. Zero-shot automl with pretrained models. InInternational Conference on Machine Learning, pages 17138–17155. PMLR, 2022

  7. [7]

    Tabpfn: A transformer that solves small tabu- lar classification problems in a second,

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022

  8. [8]

    Optimus: Optimization modeling using MIP solvers and large language models

    Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Optimization modeling using mip solvers and large language models.arXiv preprint arXiv:2310.06116, 2023

  9. [9]

    Logme: Practical assessment of pre-trained models for transfer learning

    Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. Logme: Practical assessment of pre-trained models for transfer learning. InInternational conference on machine learning, pages 12133–12143. PMLR, 2021

  10. [10]

    Model spider: Learning to rank pre-trained models efficiently.Advances in Neural Information Processing Systems, 36:13692– 13719, 2023

    Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye. Model spider: Learning to rank pre-trained models efficiently.Advances in Neural Information Processing Systems, 36:13692– 13719, 2023

  11. [11]

    Know2vec: A black-box proxy for neural network retrieval

    Zhuoyi Shang, Yanwei Liu, Jinxia Liu, Xiaoyan Gu, Ying Ding, and Xiangyang Ji. Know2vec: A black-box proxy for neural network retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20346–20353, 2025

  12. [12]

    Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

  13. [13]

    Office Products

    Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. Em- bedllm: Learning compact representations of large language models.arXiv preprint arXiv:2410.02223, 2024

  14. [14]

    Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning

    Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  15. [15]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  16. [16]

    An information-theoretic approach to transferability in task transfer learning

    Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information-theoretic approach to transferability in task transfer learning. In2019 IEEE international conference on image processing (ICIP), pages 2309–2313. IEEE, 2019

  17. [17]

    Transferability and hardness of supervised classification tasks

    Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transferability and hardness of supervised classification tasks. InProceedings of the IEEE/CVF international conference on computer vision, pages 1395–1405, 2019. 11

  18. [18]

    Leep: A new measure to evaluate transferability of learned representations

    Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. Leep: A new measure to evaluate transferability of learned representations. InInternational conference on machine learning, pages 7294–

  19. [19]

    Ranking neural checkpoints

    Yandong Li, Xuhui Jia, Ruoxin Sang, Yukun Zhu, Bradley Green, Liqiang Wang, and Boqing Gong. Ranking neural checkpoints. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2663–2673, 2021

  20. [20]

    Pactran: Pac-bayesian met- rics for estimating the transferability of pretrained models to classification tasks

    Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, and Radu Soricut. Pactran: Pac-bayesian met- rics for estimating the transferability of pretrained models to classification tasks. InEuropean Conference on Computer Vision, pages 252–268. Springer, 2022

  21. [21]

    Otce: A transferability metric for cross-domain cross-task representations

    Yang Tan, Yang Li, and Shao-Lun Huang. Otce: A transferability metric for cross-domain cross-task representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15779–15788, 2021

  22. [22]

    A linearized framework and a new benchmark for model selection for fine-tuning.arXiv preprint arXiv:2102.00084, 2021

    Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, and Pietro Perona. A linearized framework and a new benchmark for model selection for fine-tuning.arXiv preprint arXiv:2102.00084, 2021

  23. [23]

    Transferability estimation using bhattacharyya class separability

    Michal Pándy, Andrea Agostinelli, Jasper Uijlings, Vittorio Ferrari, and Thomas Mensink. Transferability estimation using bhattacharyya class separability. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9172–9182, 2022

  24. [24]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031, 2024

  25. [25]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024

  26. [26]

    arXiv preprint arXiv:2410.03834 , year=

    Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections.arXiv preprint arXiv:2410.03834, 2024

  27. [27]

    Routereval: A comprehensive benchmark for routing llms to explore model-level scaling up in llms

    Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, and Liang Lin. Routereval: A comprehensive benchmark for routing llms to explore model-level scaling up in llms. arXiv preprint arXiv:2503.10657, 2025

  28. [28]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  29. [29]

    BPR: Bayesian Personalized Ranking from Implicit Feedback

    Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618, 2012

  30. [30]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  31. [31]

    Wiley New York, 1959

    R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

  32. [32]

    Task2vec: Task embedding for meta-learning

    Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019

  33. [33]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  34. [34]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  35. [35]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  36. [36]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 12

  37. [37]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  38. [38]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  39. [39]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  41. [41]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  42. [42]

    Llama 3.1 model card

    Meta AI. Llama 3.1 model card. https://github.com/meta-llama/llama-models/blob/main/ models/llama3_1/MODEL_CARD.md, 2024. Accessed: 2026-04-13

  43. [43]

    Llama 3.3 model card

    Meta AI. Llama 3.3 model card. https://github.com/meta-llama/llama-models/blob/main/ models/llama3_3/MODEL_CARD.md, 2024. Accessed: 2026-04-13

  44. [44]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  45. [45]

    Gpt-oss-20b.https://huggingface.co/openai/gpt-oss-20b, 2025

    OpenAI. Gpt-oss-20b.https://huggingface.co/openai/gpt-oss-20b, 2025. Accessed: 2026-05

  46. [46]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  47. [47]

    Llama 4 model card

    Meta AI. Llama 4 model card. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama4/, 2025. Accessed: 2026-04-13

  48. [48]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

    Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

  49. [49]

    Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

  50. [50]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825, 2023

  51. [51]

    Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research.Trans...

  52. [52]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (...

  53. [53]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

  54. [54]

    Musique: Multihop questions via single-hop question composition.Trans

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Trans. Assoc. Comput. Linguistics, 10:539–554, 2022

  55. [55]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, Findings of ACL, pages 5687–5711. Association for...

  56. [56]

    Chawla, Chuxu Zhang, and Yanfang Ye

    Zheyuan Zhang, Yiyang Li, Nhi Ha Lan Le, Zehong Wang, Tianyi Ma, Vincent Galassi, Keerthiram Murugesan, Nuno Moniz, Werner Geyer, Nitesh V . Chawla, Chuxu Zhang, and Yanfang Ye. NGQA: A nutritional graph question answering benchmark for personalized health-aware nutritional reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher P...

  57. [57]

    RSVLM-QA: A benchmark dataset for remote sensing vision language model-based question answering

    Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. RSVLM-QA: A benchmark dataset for remote sensing vision language model-based question answering. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors,Proceeding...

  58. [58]

    Predicting neural network accuracy from weights, 2021

    Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, and Ilya Tolstikhin. Predicting neural network accuracy from weights.arXiv preprint arXiv:2002.11448, 2020

  59. [59]

    Predicting trends in the quality of state-of- the-art neural networks without access to training or testing data.Nature Communications, 12(1):4122, 2021

    Charles H Martin, Tongsu Peng, and Michael W Mahoney. Predicting trends in the quality of state-of- the-art neural networks without access to training or testing data.Nature Communications, 12(1):4122, 2021

  60. [60]

    Hyper-representations as generative models: Sampling unseen neural network weights.Advances in Neural Information Processing Systems, 35:27906–27920, 2022

    Konstantin Schürholt, Boris Knyazev, Xavier Giró-i Nieto, and Damian Borth. Hyper-representations as generative models: Sampling unseen neural network weights.Advances in Neural Information Processing Systems, 35:27906–27920, 2022

  61. [61]

    Learning model representations using publicly available model hubs.arXiv preprint arXiv:2510.02096, 2025

    Damian Falk, Konstantin Schürholt, Konstantinos Tzevelekakis, Léo Meynent, and Damian Borth. Learning model representations using publicly available model hubs.arXiv preprint arXiv:2510.02096, 2025

  62. [62]

    LLM DNA: Tracing Model Evolution via Functional Representations

    Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, and Bingsheng He. Llm dna: Tracing model evolution via functional representations.arXiv preprint arXiv:2509.24496, 2025

  63. [63]

    We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633, 2025

    Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, and Yedid Hoshen. We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633, 2025

  64. [64]

    ThinkGuard: Deliberative slow thinking leads to cautious guardrails

    Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. ThinkGuard: Deliberative slow thinking leads to cautious guardrails. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 13698– 13713, Vienna, Austria, July 2025. Association for Computa...

  65. [65]

    Diagnosing and mitigating modality interference in multimodal large language models.arXiv preprint arXiv:2505.19616, 2025

    Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models.arXiv preprint arXiv:2505.19616, 2025

  66. [66]

    Omniguard: Unified omni-modal guardrails with deliberate reasoning.CoRR, abs/2512.02306, 2025

    Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. Omniguard: Unified omni-modal guardrails with deliberate reasoning.CoRR, abs/2512.02306, 2025

  67. [67]

    Towards fundamentally scalable model selection: Asymptotically fast update and selection.arXiv preprint arXiv:2406.07536, 2024

    Wenxiao Wang, Weiming Zhuang, and Lingjuan Lyu. Towards fundamentally scalable model selection: Asymptotically fast update and selection.arXiv preprint arXiv:2406.07536, 2024

  68. [68]

    Matrix factorization techniques for recommender systems

    Yehuda Koren, Robert Bell, and Chris V olinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009

  69. [69]

    Provable inductive matrix completion.arXiv preprint arXiv:1306.0626, 2013

    Prateek Jain and Inderjit S Dhillon. Provable inductive matrix completion.arXiv preprint arXiv:1306.0626, 2013. 14

  70. [70]

    Goal-directed inductive matrix completion

    Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inderjit S Dhillon. Goal-directed inductive matrix completion. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1165–1174, 2016

  71. [71]

    Inductive matrix completion based on graph neural networks.arXiv preprint arXiv:1904.12058, 2019

    Muhan Zhang and Yixin Chen. Inductive matrix completion based on graph neural networks.arXiv preprint arXiv:1904.12058, 2019

  72. [72]

    Dropoutnet: Addressing cold start in recommender systems.Advances in neural information processing systems, 30, 2017

    Maksims V olkovs, Guangwei Yu, and Tomi Poutanen. Dropoutnet: Addressing cold start in recommender systems.Advances in neural information processing systems, 30, 2017

  73. [73]

    upstream

    Lloyd S Shapley et al. A value for n-person games. 1953. 15 A Appendix A.1 Appendix Overview This appendix provides additional details, analyses, and reproducibility information for MODELLENS. We organize the supplementary material as follows. A. Learned Embedding SpaceFigure 4–Figure 5 Visualizations of the interaction-trained and semantic-only model–dat...