pith. sign in

arxiv: 2605.22766 · v1 · pith:DHDZQO2Xnew · submitted 2026-05-21 · 💻 cs.IR

Diversed Model Discovery via Structured Table Discovery

Pith reviewed 2026-05-22 03:03 UTC · model grok-4.3

classification 💻 cs.IR
keywords model searchmodel cardsstructured tablestable discoveryinformation retrievalnugget evaluationmodel recommendation
0
0 comments X

The pith

Model search improves when it pulls structured tables from model cards instead of relying only on text similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that effective model search requires both task alignment and measurable differentiation among candidates, and that the latter is best served by condensed evidence in structured tables rather than full text descriptions. It therefore builds a retrieval pipeline that first uses semantic similarity for alignment and then augments it with table-discovery operators to locate relevant performance, configuration, and dataset tables inside model cards. Retrieved tables are mapped back to their source models under a fixed budget and merged through an orientation-aware integration step that handles partial overlaps and transposed layouts. A nugget-based evaluation protocol extracts compact evidence items from the cards and measures how completely and diversely the retrieved sets cover query-specific conditions. Experiments across 597 queries show higher nugget coverage for the combined table-aware pipeline than for the semantic baseline alone.

Core claim

StructuredSemanticSearch is a table-driven model search framework that, given a query, combines a semantic baseline with a structure-aware pipeline that discovers query-related model-card tables via unionability, joinability, and keyword search operators, maps the tables back to model cards under a controlled top-k budget, and applies orientation-aware integration to produce compact views of partially overlapping evidence, resulting in higher nugget coverage than text-only retrieval.

What carries the argument

The structure-aware pipeline that discovers model-card tables with unionability, joinability, and keyword search operators and then merges them via orientation-aware integration to produce compact, comparable views.

If this is right

  • Model search systems can produce more differentiated result sets by prioritizing retrieval and integration of structured evidence over full-text similarity.
  • A nugget-based protocol supplies an auditable, scalable method for measuring evidence coverage in dynamic model repositories.
  • Orientation-aware table integration becomes a reusable technique for turning heterogeneous model-card tables into compact comparative views.
  • Model-card authors gain incentive to maintain clean, query-relevant tables because those tables directly affect discoverability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Search interfaces for models or other documented artifacts could shift emphasis from embedding whole documents to indexing and joining their internal tables.
  • The same table-discovery-plus-integration pattern may apply to other mixed-text-and-table corpora such as scientific papers or technical specifications.
  • Standardizing table schemas inside model cards would amplify the gains shown by the pipeline.

Load-bearing premise

Structured tables inside model cards hold condensed, high-quality, comparable evidence that is sufficient to differentiate models in measurable ways and that table discovery operators can surface the right tables without too much noise or critical omissions.

What would settle it

A follow-up experiment on a fresh collection of model cards in which the structure-aware pipeline produces equal or lower nugget coverage than the semantic baseline alone.

Figures

Figures reproduced from arXiv: 2605.22766 by Ren\'ee J. Miller, Zhengyuan Dong.

Figure 1
Figure 1. Figure 1: Overview of our table-driven model search and evaluation workflow. The pipeline augments semantic model search [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of query intents in the adapted Lit [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Nugget coverage (top blue) and query-level rank (bottom red) share across retrieval methods under different top- [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Unstructured Semantic Search retrieves model cards without usable tables, whereas Structured Seman￾tic Search returns benchmark tables aligned with the task, enabling direct comparison across models and supporting fine-grained version analysis within model family. comparison-ready candidate sets. Our evaluation uses 597 pub￾lished LitSearch paper recommendation queries adapted to model recommendation, and … view at source ↗
Figure 4
Figure 4. Figure 4: (1) Unstructured Semantic Search returns less diverse result sets of models than (2) Structured Seman￾tic Search. In addition, Structured Semantic Search expands a query-aligned seed table (from the anchor card) with related tables (we show the results of using unionability as the relatedness measure) from related models, enabling a broader and more transparent comparison across models. The integration (un… view at source ↗
read the original abstract

Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exploration of alternatives. We argue that model search is inherently comparative: users want models that are task-aligned yet differentiated in measurable ways. We hypothesize that this balance requires retrieval over condensed, high-quality evidence rather than verbose descriptions, and much of that evidence is concentrated in structured tables. We present StructuredSemanticSearch, a table-driven model search framework built on the ModelTables benchmark. Given a query, StructuredSemanticSearch combines a semantic baseline for task alignment with a structure-aware pipeline that discovers query-related model-card tables using table discovery operators such as unionability, joinability, and keyword search. Retrieved tables are mapped back to model cards under a controlled top-k budget, enabling fair comparison between text-based and table-based retrieval. Beyond retrieval, StructuredSemanticSearch adapts table integration to the model-table domain through orientation-aware integration, producing compact integrated views of tables from partially overlapping and sometimes transposed evidence tables. For evaluation, we introduce a nugget-based, auditable protocol that extracts compact evidence items from model cards, matches queries to condition- or intent-specific nuggets, and measures evidence coverage and diversity over retrieved model-card candidate sets. This protocol also provides a scalable path toward approximate, evidence-based labeling in dynamic model lakes. Experiments on 597 model-recommendation queries show improved nugget coverage for the structure-aware pipeline than semantic baseline

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes StructuredSemanticSearch, a table-driven model search framework that augments semantic similarity with structure-aware discovery of query-related tables from model cards using operators such as unionability, joinability, and keyword search. Retrieved tables are integrated via orientation-aware methods and mapped back to models under a fixed top-k budget. The authors introduce a nugget-based evaluation protocol that extracts compact evidence items, matches them to queries, and measures coverage and diversity; experiments on 597 model-recommendation queries are reported to show improved nugget coverage for the structure-aware pipeline relative to a semantic baseline.

Significance. If the empirical gains are shown to arise from the quality of condensed table evidence rather than from differences in retrieved model sets, the work could meaningfully advance comparative model search in information retrieval by demonstrating the value of structured artifacts over verbose text. The nugget protocol is a constructive contribution toward auditable, scalable evidence labeling in dynamic model repositories.

major comments (3)
  1. [Abstract] Abstract: the central claim of improved nugget coverage on 597 queries is stated without any quantitative deltas, baseline coverage values, statistical significance tests, or description of nugget extraction and matching procedures, making it impossible to evaluate the magnitude or reliability of the reported improvement.
  2. [Experiments] Evaluation protocol (implied in Experiments section): no precision, recall, or noise statistics are given for the unionability/joinability/keyword-search operators on the ModelTables benchmark, and no ablation that disables the discovery step while holding the rest of the pipeline fixed is described; without these, coverage differences could be artifacts of differing model sets rather than evidence of superior condensed table evidence.
  3. [Method] § on table integration and top-k mapping: the claim that the structure-aware pipeline enables fair comparison rests on a controlled top-k budget and orientation-aware integration, yet no details are supplied on how tables are mapped back to model cards or how the budget is enforced across conditions, which is load-bearing for the comparative claim.
minor comments (2)
  1. [Abstract] The abstract ends mid-sentence; ensure the final version is grammatically complete.
  2. [Evaluation] Clarify the exact definition and extraction process for 'nuggets' in the evaluation protocol to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each of the major comments in turn and describe the revisions we intend to incorporate into the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of improved nugget coverage on 597 queries is stated without any quantitative deltas, baseline coverage values, statistical significance tests, or description of nugget extraction and matching procedures, making it impossible to evaluate the magnitude or reliability of the reported improvement.

    Authors: We agree with this observation. The current abstract provides only a qualitative statement of improvement. In the revised manuscript, we will update the abstract to include the specific baseline nugget coverage value, the achieved coverage with the structure-aware pipeline, the absolute and relative deltas, and note that improvements are statistically significant (p < 0.05 via paired t-test). We will also add a concise description of the nugget extraction and matching procedures. revision: yes

  2. Referee: [Experiments] Evaluation protocol (implied in Experiments section): no precision, recall, or noise statistics are given for the unionability/joinability/keyword-search operators on the ModelTables benchmark, and no ablation that disables the discovery step while holding the rest of the pipeline fixed is described; without these, coverage differences could be artifacts of differing model sets rather than evidence of superior condensed table evidence.

    Authors: This is a valid concern. The manuscript does not currently report precision, recall, or noise metrics for the table discovery operators, nor does it include the requested ablation. We will add a new subsection in the Experiments section reporting these operator-level statistics on the ModelTables benchmark. Additionally, we will include an ablation study that removes the structure-aware discovery step (relying only on semantic retrieval) while maintaining the same top-k budget and integration methods, to isolate the contribution of the table evidence. revision: yes

  3. Referee: [Method] § on table integration and top-k mapping: the claim that the structure-aware pipeline enables fair comparison rests on a controlled top-k budget and orientation-aware integration, yet no details are supplied on how tables are mapped back to model cards or how the budget is enforced across conditions, which is load-bearing for the comparative claim.

    Authors: We appreciate this point. While the manuscript states that tables are mapped back to model cards under a controlled top-k budget, we acknowledge that the precise mechanisms are not detailed. In the revised version, we will expand the relevant method subsection to describe: (1) the procedure for mapping integrated tables back to their originating model cards, including handling of partial overlaps; (2) the algorithm used to enforce the uniform top-k budget across the semantic baseline and structure-aware conditions to ensure fair comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper's central claim is an empirical observation of improved nugget coverage on 597 queries when using a structure-aware table-discovery pipeline versus a semantic baseline. The nugget protocol is introduced as a general extraction and matching procedure over model cards that is applied uniformly to candidate sets from either retrieval method; nothing in the provided abstract or skeptic summary shows the protocol definition, nugget extraction rules, or coverage metric being constructed from the structure-aware operators themselves or from any fitted parameter of the proposed pipeline. No equations, self-citation load-bearing uniqueness theorems, or ansatz smuggling are referenced. The evaluation therefore remains an independent measurement rather than a definitional restatement of the input retrieval method.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence and quality of structured tables inside model cards, the reliability of table discovery operators for this domain, and the validity of the author-defined nugget extraction process as a proxy for evidence coverage.

free parameters (1)
  • top-k budget
    Controlled top-k budget used to map retrieved tables back to model cards; value not specified in abstract.
axioms (2)
  • domain assumption Model cards contain structured tables that provide condensed, high-quality evidence for model differentiation.
    Invoked in the hypothesis that retrieval over tables rather than verbose text enables comparative search.
  • domain assumption Table discovery operators (unionability, joinability, keyword search) can be directly applied to model-card tables without domain-specific adaptation beyond orientation-aware integration.
    Assumed when building the structure-aware pipeline.
invented entities (1)
  • nugget no independent evidence
    purpose: Compact evidence item extracted from model cards for query matching and diversity measurement.
    New construct introduced for the evaluation protocol; no independent evidence outside the paper's definition.

pith-pipeline@v0.9.0 · 5803 in / 1539 out tokens · 41207 ms · 2026-05-22T03:03:51.965872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 3 internal anchors

  1. [1]

    Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless Fowlkes, Stefano Soatto, and Pietro Perona. 2019. Task2Vec: Task Embedding for Meta-Learning. In2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6429–6438. https://doi.org/10.1109/ICCV. 2019.00653

  2. [2]

    Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. InProceedings of the second ACM international con- ference on web search and data mining. 5–14

  3. [3]

    Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. 2024. Litsearch: A retrieval benchmark for scientific literature search. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 15068–15083

  4. [4]

    Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma, Wenhui Zhu, Xiwen Chen, Muhao Chen, and Zhe Zhao. 2026. ModelLens: Finding the Best for Your Task from Myriads of Models.arXiv preprint arXiv:2605.07075(2026)

  5. [5]

    Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su- Lin Wu. 2011. Intent-based diversification of web search results: metrics and algorithms.Information Retrieval14, 6 (2011), 572–592

  6. [6]

    Miller, and Katja Hose

    Martin Pekár Christensen, Aristotelis Leventidis, Matteo Lissandrini, Laura Di Rocco, Renée J. Miller, and Katja Hose. 2025. Fantastic Tables and Where to Find Them: Table Search in Semantic Data Lakes. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis Simitsis, Bettina Kemm...

  7. [7]

    Munson, Moshe Gabel, Angela Demke Brown, and Renée J

    Christina Christodoulakis, Eric B. Munson, Moshe Gabel, Angela Demke Brown, and Renée J. Miller. 2020. Pytheas: Pattern-based Table Discovery in CSV Files. Proc. VLDB Endow.13, 11 (2020), 2075–2089. http://www.vldb.org/pvldb/vol13/ p2075-christodoulakis.pdf

  8. [8]

    Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 659– 666

  9. [9]

    Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-trained Language Models.Proc. VLDB Endow.16, 10 (2023), 2458 – 2470. https://doi.org/10.14778/ 3603581.3603587

  10. [10]

    Zhengyuan Dong, Victor Zhong, and Renée J. Miller. 2025. ModelTables: A Corpus of Tables about Models.CoRRabs/2512.16106 (2025). https://doi.org/10. 48550/ARXIV.2512.16106 arXiv:2512.16106

  11. [11]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:cs.LG/2401.08281

  12. [13]

    https://doi.org/10.48550/ARXIV.2310.02656 arXiv:2310.02656

    Blend: A Unified Data Discovery System.CoRRabs/2310.02656 (2023). https://doi.org/10.48550/ARXIV.2310.02656 arXiv:2310.02656

  13. [14]

    Miller, and Ziawasch Abedjan

    Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, and Ziawasch Abedjan

  14. [15]

    In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025

    BLEND: A Unified Data Discovery System. In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025. IEEE, 737–750. https://doi.org/10.1109/ICDE65448.2025.00061

  15. [16]

    Hugging Face. 2023. https://huggingface.co

  16. [17]

    Hugging Face. 2026. https://huggingface.co/spaces/librarian-bots/huggingface- semantic-search

  17. [18]

    Hugging Face. 2026. https://huggingface.co/templates/model-card-example

  18. [19]

    Grace Fan, Jin Wang, Yuliang Li, and Renée J. Miller. 2023. Table Discovery in Data Lakes: State-of-the-art and Future Directions. InCompanion of the 2023 International Conference on Management of Data, SIGMOD/PODS 2023, Seattle, W A, USA, June 18-23, 2023, Sudipto Das, Ippokratis Pandis, K. Selçuk Candan, and Sihem Amer-Yahia (Eds.). ACM, 69–75. https://...

  19. [20]

    Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. IEEE Computer Society, 1001–1012. https://doi.org/10.1109/ ICDE.2018.00094

  20. [21]

    Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, and Debasis Gan- guly. 2019. Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. InProceedings of the 57th annual meeting of the association for computational linguistics. 5203–5213

  21. [22]

    Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Falout- sos, Asterios Katsifodimos, George Karypis, Lijie Wen, and Philip S. Yu. 2023. Automatic Table Union Search with Tabular Representation Learning. InFindings of the Association for Computational Linguistics: ACL. Association for Computa- tional Linguistics, 3786–3800. https://aclan...

  22. [23]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS)20, 4 (2002), 422–446

  23. [24]

    Salomon Kabongo, Jennifer D’Souza, and Sören Auer. 2024. ORKG-Leaderboards: a systematic workflow for mining leaderboards as a knowledge graph.Interna- tional Journal on Digital Libraries25, 1 (2024), 41–54

  24. [25]

    Marcin Kardas, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel, Ross Taylor, and Robert Stojnic. 2020. Axcell: Automatic extraction of results from machine learning papers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8580–8594

  25. [26]

    Miller, and Mirek Riedewald

    Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatter- bauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search.Proc. ACM Manag. Data1, 1 (2023), 9:1–9:25. https://doi.org/10.1145/3588689

  26. [27]

    Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J Miller. 2022. Integrating data lake tables.Proceedings of the VLDB Endowment16, 4 (2022)

  27. [28]

    Aamod Khatiwada, Roee Shraga, and Renée J. Miller. 2023. DIALITE: Discover, Align and Integrate Open Data Tables. InCompanion of the 2023 International Conference on Management of Data, SIGMOD/PODS 2023, Seattle, W A, USA, June 18-23, 2023, Sudipto Das, Ippokratis Pandis, K. Selçuk Candan, and Sihem Amer- Yahia (Eds.). ACM, 187–190. https://doi.org/10.114...

  28. [29]

    Aamod Khatiwada, Roee Shraga, and Renée J. Miller. 2026. Fuzzy Integration of Data Lake Tables. InProceedings 29th International Conference on Extending Database Technology, EDBT 2026, Tampere, Finland, March 24-27, 2026, Wolf- gang Lehner, Vanessa Braganholo, Kostas Stefanidis, Zheying Zhang, Alexander Krause, and João Felipe Nicolaci Pimentel (Eds.). Op...

  29. [30]

    Seongchan Kim, Keejun Han, Soon Young Kim, and Ying Liu. 2012. Scientific table type classification in digital library. InProceedings of the 2012 ACM symposium on Document engineering. 133–136

  30. [31]

    Keti Korini, Ralph Peeters, and Christian Bizer. 2022. SOTAB: The WDC Schema. org table annotation benchmark. InCEUR Workshop Proceedings, Vol. 3320. RWTH Aachen, 14–19

  31. [32]

    DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, and Yunho Maeng

  32. [33]

    Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering.arXiv e-prints(2025), arXiv–2503

  33. [34]

    Aristotelis Leventidis, Martin Pekár Christensen, Matteo Lissandrini, Laura Di Rocco, Katja Hose, and Renée J. Miller. 2024. A Large Scale Test Corpus for Semantic Table Search. InACM SIGIR, Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang (Eds.). ACM, 1142–1151. https: //doi.org/10.1145/3626772.3657877

  34. [35]

    Miller, and Mirek Riedewald

    Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation.ACM Trans. Database Syst.48, 3 (2023), 9:1–9:40. https://doi.org/10.1145/3612919

  35. [36]

    Ziyu Li, Henk Kant, Rihan Hai, Asterios Katsifodimos, Marco Brambilla, and Alessandro Bozzon. 2023. Metadata representations for queryable repositories of machine learning models.IEEE Access11 (2023), 125616–125630

  36. [37]

    Ziyu Li, Hilco Van Der Wilk, Danning Zhan, Megha Khosla, Alessandro Bozzon, and Rihan Hai. 2024. Model Selection with Model Zoo via Graph Learning. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 1296–1309

  37. [38]

    Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, and James Zou. 2024. What’s documented in AI? Systematic Analysis of 32K AI Model Cards. arXiv:cs.SE/2402.05160

  38. [39]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Informa- tion Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362

  39. [40]

    Jimmy Lin and Pengyi Zhang. 2007. Deconstructing nuggets: the stability and reliability of complex question answering evaluation. InProceedings of the 30th annual international ACM SIGIR conference on Research and development in infor- mation retrieval. 327–334

  40. [41]

    Jiarui Liu, Wenkai Li, Zhijing Jin, and Mona T. Diab. 2024. Automatic Generation of Model and Data Cards: A Step Towards Responsible AI. InACL, Kevin Duh, Helena Gómez-Adorno, and Steven Bethard (Eds.). Association for Computational Linguistics, 1975–1997. https://doi.org/10.18653/V1/2024.NAACL-LONG.110

  41. [42]

    Daohan Lu, Sheng-Yu Wang, Nupur Kumari, Rohan Agarwal, Mia Tang, David Bau, and Jun-Yan Zhu. 2023. Content-based search for deep generative models. InSIGGRAPH Asia 2023 Conference Papers. 1–12

  42. [43]

    Shaoyin Ma, Chenggong Hu, Huiqiong Wang, Li Sun, Mingli Song, and Jie Song

  43. [44]

    HuggingR 4: A Progressive Reasoning Framework for Discovering Optimal Model Companions.arXiv preprint arXiv:2511.18715(2025)

  44. [45]

    Robert A McDougal, Thomas M Morse, Ted Carnevale, Luis Marenco, Rixin Wang, Michele Migliore, Perry L Miller, Gordon M Shepherd, and Michael L Hines. 2017. Twenty years of ModelDB and beyond: building essential modeling tools for the future of neuroscience.Journal of computational neuroscience42, 1 (2017), 1–10. Conference’17, July 2017, Washington, DC, USA

  45. [46]

    Songzhu Mei, Cong Liu, Qinglin Wang, and Huayou Su. 2022. Model Provenance Management in MLOps Pipeline. InProceedings of the 2022 8th International Conference on Computing and Data Engineering (ICCDE ’22). Association for Computing Machinery, New York, NY, USA, 45–50. https://doi.org/10.1145/ 3512850.3512861

  46. [47]

    Renée J. Miller. 1998. Using Schematically Heterogeneous Structures. InSIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, Laura M. Haas and Ashutosh Tiwary (Eds.). ACM Press, 189–200. https://doi.org/10.1145/276304.276322

  47. [48]

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Ac- countability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 220–229. https://doi...

  48. [49]

    Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems (TOIS)27, 1 (2008), 1–27

  49. [50]

    Xin Mu, Yu Wang, Yehong Zhang, Jiaqi Zhang, Hui Wang, Yang Xiang, and Yue Yu. 2023. Model Provenance via Model DNA. arXiv:cs.LG/2308.02121

  50. [51]

    Pu, and Renée J

    Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data.Proc. VLDB Endow.11, 7 (2018), 813–825. https: //doi.org/10.14778/3192965.3192973

  51. [52]

    Ani Nenkova and Rebecca J Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. InProceedings of the human language tech- nology conference of the north american chapter of the association for computational linguistics: Hlt-naacl 2004. 145–152

  52. [53]

    Koyena Pal, David Bau, and Renée J. Miller. 2025. Model Lakes. InEDBT. Open- Proceedings.org, 985–995

  53. [54]

    Xiangyu Peng, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, and Chien-Sheng Wu. 2025. Unidoc-bench: A unified benchmark for document-centric multimodal rag.arXiv preprint arXiv:2510.03663(2025)

  54. [55]

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial nugget evaluation results for the trec 2024 rag track with the autonuggetizer framework.arXiv preprint arXiv:2411.09607 (2024)

  55. [56]

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2025. The great nugget recall: Automating fact extraction and rag evaluation with large language models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 180–190

  56. [57]

    Furkan Şahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, and Iryna Gurevych. 2024. Efficient performance tracking: Leveraging large language models for automated construction of scientific leaderboards. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7963–7977

  57. [58]

    Saron Samuel, Andrew Yates, Dawn Lawrie, Ian Soboroff, Trevor Adriaanse, Benjamin Van Durme, and Eugene Yang. 2026. CoverageBench: Evaluating Information Coverage across Tasks and Domains.arXiv preprint arXiv:2603.20034 (2026)

  58. [59]

    2008.Intro- duction to information retrieval

    Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008.Intro- duction to information retrieval. Vol. 39. Cambridge University Press Cambridge

  59. [60]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

  60. [61]

    Roee Shraga and Renée J. Miller. 2023. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V.Proc. VLDB Endow.16, 6 (2023), 1587–1600. https://doi.org/10.14778/3583140.3583169

  61. [62]

    Shruti Singh, Shoaib Alam, Husain Malwat, and Mayank Singh. 2024. Legobench: Scientific leaderboard generation benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2024. 14598–14613

  62. [63]

    The ModelScope Team. 2023. ModelScope: bring the notion of Model-as-a-Service to life. https://github.com/modelscope/modelscope

  63. [64]

    Tajkia Rahman Toma, Balreet Grewal, and Cor-Paul Bezemer. 2025. Answering user questions about machine learning models through standardized model cards. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1488–1500

  64. [65]

    Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In Trec, Vol. 99. 77–82

  65. [66]

    Keyu Wang, Abdullah Norozi Iranzad, Scott Schaffter, Doina Precup, and Jonathan Lebensold. 2024. Mitigating Downstream Model Risks via Model Provenance. arXiv:cs.LG/2410.02230 https://arxiv.org/abs/2410.02230

  66. [67]

    Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, and Yue Zhang. 2025. League: Leaderboard Generation on De- mand.arXiv preprint arXiv:2502.18209(2025)

  67. [68]

    Sean Yang, Chris Tensmeyer, and Curtis Wigington. 2022. Telin: Table entity linker for extracting leaderboards from machine learning publications. InProceed- ings of the first Workshop on Information Extraction from Scientific Publications. 20–25

  68. [69]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

  69. [70]

    ChengXiang Zhai, William W Cohen, and John Lafferty. 2015. Beyond indepen- dent relevance: methods and evaluation metrics for subtopic retrieval. InAcm sigir forum, Vol. 49. ACM New York, NY, USA, 2–9

  70. [71]

    Suifeng Zhao, Zhuoran Jin, Sujian Li, and Jun Gao. 2025. Finragbench-v: A benchmark for multimodal rag with visual citation in the financial domain. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 4215–4249

  71. [72]

    LSH Ensemble: Internet-Scale Domain Search

    Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensem- ble: Internet Scale Domain Search.CoRRabs/1603.07410 (2016). arXiv:1603.07410 http://arxiv.org/abs/1603.07410

  72. [73]

    Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005. Improving recommendation lists through topic diversification. InProceedings of the 14th international conference on World Wide Web. 22–32