ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

arxiv: 2605.16902 · v1 · pith:4DIWYDSKnew · submitted 2026-05-16 · 💻 cs.LG

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

Haofei Yu , Jiaxuan You , Peter Clark , Bodhisattwa Prasad Majumder , Kyle Richardson This is my paper

Pith reviewed 2026-05-19 20:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords artifact graphSOTA discoverymissing link predictiongraph neural networksLLM verification agentsHugging Facemodel-dataset evaluationautomatic benchmarking

0 comments p. Extension

pith:4DIWYDSK Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4DIWYDSK}

Prints a linked pith:4DIWYDSK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

An artifact graph of models and datasets lets graph methods rank untested performance links to find new SOTA results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that scientific artifacts like models and datasets form a graph connected by published evaluation results. Graph neural networks or graph-augmented language models can then rank promising unobserved model-dataset pairs based on that structure. A second stage uses LLM coding agents to run verification experiments on the top-ranked candidates. A sympathetic reader would care because exhaustive manual testing across thousands of models is impossible, so automating discovery from existing data could speed up finding better systems and surfacing new research directions.

Core claim

ArtifactLinker models Hugging Face as an artifact graph with models and datasets as nodes and evaluations as edges. The framework ranks missing links via GNNs or graph-augmented LLMs, then verifies top candidates through LLM-based agents that execute coding experiments. On the new ArtifactBench benchmark containing 14,053 artifacts and 51,337 relations, graph structure proves effective for missing link prediction, and the full pipeline identifies potential SOTA results together with research insights.

What carries the argument

The artifact graph, with models and datasets as nodes and known evaluations as edges, which carries the argument by enabling structure-based ranking of unobserved performance links.

If this is right

Graph structures alone can predict which models will work well on which datasets without direct testing.
Ranking candidates followed by automated verification surfaces new high-performing combinations.
The method scales to large artifact collections such as those hosted on Hugging Face.
End-to-end use generates both SOTA candidates and additional research insights from existing evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph approach could extend to other artifact types such as papers or codebases to link related ideas across fields.
As more evaluations are published the graph becomes denser, which should improve future prediction accuracy without changing the method.
Automated systems could run continuously, monitoring new publications and suggesting the next experiments to run.
Adding node features like model size or architecture details might further strengthen the link predictions beyond pure graph structure.

Load-bearing premise

Existing published evaluations already form a connected graph whose patterns allow accurate ranking of which unobserved model-dataset pairs would perform well.

What would settle it

Experimentally running the top-ranked links and finding that their actual performance falls below several already-published results on the same datasets would show the ranking step does not work.

Figures

Figures reproduced from arXiv: 2605.16902 by Bodhisattwa Prasad Majumder, Haofei Yu, Jiaxuan You, Kyle Richardson, Peter Clark.

**Figure 1.** Figure 1: Artifact graph structure and SOTA discovery task formulation. (a) Example graph. A visualization demonstrating the graph structure, highlighting its inherent sparsity and the significant number of missing links between different artifact types. (b) Node statistics. Detailed breakdown showing the distribution of node counts across different artifact categories. (c) Edge statistics. Breakdown illustrating th… view at source ↗

**Figure 2.** Figure 2: Overview of ARTIFACTLINKER and its evaluation framework. (Left) The twostage rank-and-verify pipeline. A GNN-based ranking model first estimates the ranking score for unobserved model–dataset pairs. The top-ranked candidates are then selected for execution in the verification stage. (Right) Ranking evaluation tasks. We evaluate the system under both transductive (nodes observed during training) and induct… view at source ↗

**Figure 4.** Figure 4: Error distribution of reproduced verification results. We show the error distribution across datasets in our reproduced evaluation. The number after each dataset name denotes the number of evaluated models, and discriminative and generative models are shown separately. Initial Embedding Training Method Graph Structure 0.45 0.50 0.55 0.60 0.65 0.70 S p e a r m a n ( T r a n s d u c tiv e ) Voyage Rand… view at source ↗

**Figure 6.** Figure 6: Degree analysis of attribution prediction results. We ablate on LLMs, LLMs with 1-hop neighborhood context, and GNNbased methods. We split the test set based on the node degrees of the datasets. Gray bars indicate the degree distribution of dataset nodes. 10 0 10 1 10 2 # Models Verified (K) 0.0 0.2 0.4 0.6 0.8 1.0 Best Found / Oracle Random Link head only Attr head only Joint (link × attr) [PITH_FUL… view at source ↗

**Figure 10.** Figure 10: Ablation study on GNN layer numbers (attribute ranking and prediction). GATv2 as the backbone model. MAE is the lower the better while Spearman is the higher the better. 1 2 3 4 5 6 7 8 9 10 11 Rank k 0.1 0.5 1 Sin g ula r v alu e k 1 = 1.47 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative energy 90% [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 12.** Figure 12: Verified accuracy matrix for NLI tasks. We show the accuracy verification results conducted by the ARTIFACTLINKER with 45 models and 12 NLI datasets. We use ST_SE split for RobustNLI. Cells filled with "–" are because these models are two-way pretrained models, while the evaluated datasets are 3-way NLI tasks. Therefore, these models are skipped for typical datasets. Models and datasets details are in App… view at source ↗

**Figure 13.** Figure 13: shows the full-size visualization of all nodes and edges we included in our collected artifact graph. Model (9827) Paper (1702) Codebase (1295) Dataset (1205) [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

read the original abstract

Scientific artifacts such as models and datasets are foundations for research. With the rapid growth of platforms like HuggingFace, researchers now have access to a large number of artifacts. Yet, a key challenge remains: how can we automatically discover the state-of-the-art (SOTA) model for a given dataset by fully leveraging existing artifacts? We formalize this task as automatic SOTA discovery by modeling HuggingFace as an artifact graph, where nodes are models/datasets and edges represent evaluations. We propose ArtifactLinker, a two-stage framework: (1) ranking promising unobserved model--dataset links using Graph Neural Networks (GNNs) or graph-augmented Large Language Models (LLMs), and (2) verifying top-ranked links via coding experiments with LLM-based agents. We further introduce a benchmark named ArtifactBench with 14,053 artifacts and 51,337 relations to evaluate the performance of both stages. Results show that (1) graph structures between existing artifacts are effective for missing link prediction; (2) end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results and research insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ArtifactLinker, a two-stage framework for automatic state-of-the-art (SOTA) discovery. It models Hugging Face as an artifact graph with models and datasets as nodes and evaluations as edges. Stage 1 ranks unobserved model-dataset links via GNNs or graph-augmented LLMs for missing-link prediction. Stage 2 verifies top-ranked links by having LLM-based coding agents execute experiments. The authors introduce ArtifactBench (14,053 artifacts, 51,337 relations) to evaluate both stages and claim that graph structure aids link prediction while the end-to-end pipeline surfaces potential SOTA results and insights.

Significance. If the central claims hold after addressing verification reliability, the work could meaningfully advance automated SOTA identification by exploiting existing evaluation graphs rather than exhaustive search. ArtifactBench is a concrete, reusable contribution that enables standardized testing of artifact-link prediction methods. The graph-based ranking approach is a natural and defensible extension of link-prediction techniques to this domain.

major comments (2)

[Verification stage (method description following ranking)] The headline claim that 'end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results' rests on the verification stage, yet no quantitative evaluation of LLM-agent accuracy (error rates, code-correctness metrics, agreement with human-run benchmarks, or sensitivity to metric-implementation errors) is reported. This is load-bearing because systematic over- or under-estimation by the agents would invalidate the discovered SOTA links.
[Abstract and experimental results section] The abstract asserts that 'results show that graph structures between existing artifacts are effective for missing link prediction,' but supplies no concrete metrics, baselines, ablation controls, or error bars. The full experimental section must include these quantities (e.g., AUC, Hits@K, comparison to non-graph baselines) on ArtifactBench to substantiate the effectiveness claim.

minor comments (2)

[Benchmark construction] Clarify the exact definition of 'unobserved' links and how the train/validation/test splits on ArtifactBench avoid leakage from the same model or dataset families.
[Abstract] The abstract would benefit from a one-sentence statement of the strongest quantitative result (e.g., 'GNN achieves X% Hits@10 on ArtifactBench') to give readers immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and the verification stage.

read point-by-point responses

Referee: [Verification stage (method description following ranking)] The headline claim that 'end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results' rests on the verification stage, yet no quantitative evaluation of LLM-agent accuracy (error rates, code-correctness metrics, agreement with human-run benchmarks, or sensitivity to metric-implementation errors) is reported. This is load-bearing because systematic over- or under-estimation by the agents would invalidate the discovered SOTA links.

Authors: We agree this is a substantive point. The original manuscript emphasized qualitative case studies of discovered SOTA results and insights from the verification stage. To address the concern directly, we have added a new quantitative evaluation subsection that measures LLM-agent accuracy on a held-out set of known model-dataset links. This includes code-execution success rates, agreement with human-run benchmark scores, and an analysis of sensitivity to common metric-implementation variations. These additions are now reported in the revised experimental section. revision: yes
Referee: [Abstract and experimental results section] The abstract asserts that 'results show that graph structures between existing artifacts are effective for missing link prediction,' but supplies no concrete metrics, baselines, ablation controls, or error bars. The full experimental section must include these quantities (e.g., AUC, Hits@K, comparison to non-graph baselines) on ArtifactBench to substantiate the effectiveness claim.

Authors: The full experimental section already reports AUC-ROC, Hits@K (including Hits@10), direct comparisons to non-graph baselines (random, MLP, and collaborative filtering), graph-structure ablations, and error bars from five independent runs on ArtifactBench. However, we acknowledge that the abstract remained too high-level. We have revised the abstract to include key quantitative results (e.g., 'GNN link prediction achieves AUC 0.87 and Hits@10 of 0.62, outperforming non-graph baselines by 12-18%') while preserving brevity. This makes the effectiveness claim concrete and directly supported by the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with new benchmark and standard link-prediction methods

full rationale

The paper introduces ArtifactBench as a new graph of artifacts and evaluations, then applies standard GNN or graph-augmented LLM link prediction followed by LLM-agent verification. No equations, fitted parameters, or self-citations are shown to reduce the claimed link-prediction performance or SOTA discoveries to quantities defined or fitted on the same evaluation data. The central results are presented as experimental outcomes on held-out or unobserved links within the introduced benchmark, keeping the derivation chain independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on the untested assumption that graph structure carries predictive signal for unobserved evaluations.

axioms (1)

domain assumption Graph neural networks or graph-augmented LLMs can rank unobserved model-dataset links from existing evaluation edges.
Invoked in the first stage of ArtifactLinker.

invented entities (2)

Artifact graph no independent evidence
purpose: Represent models, datasets, and their evaluations as nodes and edges for link prediction.
Core modeling choice introduced in the paper.
ArtifactBench no independent evidence
purpose: Benchmark dataset for evaluating ranking and verification stages.
New resource released with the paper.

pith-pipeline@v0.9.0 · 5742 in / 1318 out tokens · 45494 ms · 2026-05-19T20:25:21.390548+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this task as automatic SOTA discovery by modeling HuggingFace as an artifact graph, where nodes are models/datasets and edges represent evaluations. We propose ArtifactLinker, a two-stage framework: (1) ranking promising unobserved model–dataset links using Graph Neural Networks (GNNs) or graph-augmented Large Language Models (LLMs), and (2) verifying top-ranked links via coding experiments with LLM-based agents.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ranking stage addresses the search volume by using graph-based priors to prune the vast majority of unlikely links... ˆf(m,d)=P(Smd=1)·E[Ymd|Smd=1]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

[1]

On the suitability of hug- ging face hub for empirical studies.ArXiv, abs/2307.14841,

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. On the suitability of hug- ging face hub for empirical studies.ArXiv, abs/2307.14841,

work page arXiv
[2]

semanticscholar.org/CorpusId:260203268

URL https://api. semanticscholar.org/CorpusId:260203268. Joeran Beel, Min-Yen Kan, and Moritz Baumgart. Evaluating sakana’s ai scientist for autonomous research: Wishful thinking or an emerging reality towards ’artificial research intelligence’ (ari)?ArXiv, abs/2502.14297,

work page arXiv
[3]

A large annotated corpus for learning natural language inference

Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 conference on empirical methods in natural language processing, pp. 632–642,

work page 2015
[4]

Analyzing the evolution and maintenance of ml models on hugging face.2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), pp

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. Analyzing the evolution and maintenance of ml models on hugging face.2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), pp. 607–618,

work page 2024
[5]

Joel Castaño, Rafael Cabañas, Antonio Salmer’on, David Lo, and Silverio Mart’inez- Fern’andez

URL https://api.semanticscholar.org/CorpusId:265351447. Joel Castaño, Rafael Cabañas, Antonio Salmer’on, David Lo, and Silverio Mart’inez- Fern’andez. How do machine learning models change?ArXiv, abs/2411.09645,

work page arXiv
[6]

Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire

URLhttps://api.semanticscholar.org/CorpusId:274023512. Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks for link prediction with subgraph sketching.arXiv preprint arXiv:2209.15486,

work page arXiv
[7]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,

work page 2018
[8]

Horne, Gillian R

Ned Cooper, Tiffanie N. Horne, Gillian R. Hayes, Courtney Heldreth, Michal Lahav, Jess Holbrook, and Lauren Wilcox. A systematic review and thematic analysis of community- collaborative approaches to computing research.Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems,

work page 2022
[9]

On the origin of llms: An evolutionary tree and graph for 15, 821 large language models.ArXiv, abs/2307.09793,

Sarah Gao and Andrew Gao. On the origin of llms: An evolutionary tree and graph for 15, 821 large language models.ArXiv, abs/2307.09793,

work page arXiv
[10]

DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

URL https://api.semanticscholar.org/CorpusId:220070385. Peter Alexander Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.ArXiv, abs/2406.06769,

work page arXiv
[11]

Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, D

URL https://api.semanticscholar.org/CorpusId: 270380311. Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, D. S. Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation.ArXiv, abs/2503.22708,

work page arXiv
[12]

Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried

URL https: //api.semanticscholar.org/CorpusId:199001020. Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried. From repro- duction to replication: Evaluating research agents with progressive code masking.ArXiv, abs/2506.19724,

work page arXiv
[13]

Adam: A Method for Stochastic Optimization

URLhttps://api.semanticscholar.org/CorpusId:280000499. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Anatomy of a machine learning ecosystem: 2 million models on hugging face.ArXiv, abs/2508.06811,

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a machine learning ecosystem: 2 million models on hugging face.ArXiv, abs/2508.06811,

work page arXiv
[15]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

URL https://api.semanticscholar.org/CorpusId: 234357246. Chris Lu, Cong Lu, R. T. Lange, J. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.ArXiv, abs/2408.06292,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

URLhttps://api.semanticscholar.org/CorpusId:271854887. T. Mari´ c, Dennis Gläser, Jan-Patrick Lehr, Ioannis Papagiannidis, B. Lambie, Christian H. Bischof, and Dieter Bothe. A pragmatic workflow for research software engineering in computational science.ArXiv, abs/2310.00960,

work page arXiv
[17]

org/CorpusId:263605602

URL https://api.semanticscholar. org/CorpusId:263605602. Santiago Miret and Nandan M Krishnan. Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200,

work page arXiv
[18]

Accessed: 2026-01-06

URL https://openai.com/index/ introducing-gpt-5-2/. Accessed: 2026-01-06. Mohammad Shahedur Rahman, Peng Gao, and Yuede Ji. Hugginggraph: Understanding the supply chain of llm ecosystem.ArXiv, abs/2507.14240,

work page arXiv 2026
[19]

Paper2code: Automating code generation from scientific papers in machine learning.ArXiv, abs/2504.17192,

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2code: Automating code generation from scientific papers in machine learning.ArXiv, abs/2504.17192,

work page arXiv
[20]

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Y

URLhttps://api.semanticscholar.org/CorpusId:273366814. Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Y. Zhuang. Taskbench: Benchmarking large language models for task automation.ArXiv, abs/2311.18760,

work page arXiv
[21]

Zachary S

URL https://api.semanticscholar.org/ CorpusId:265506220. Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Trans. Mach. Learn. Res., 2024,

work page 2024
[22]

Kanishka Silva, Marcel R

URL https: //arxiv.org/pdf/2409.11363.pdf. Kanishka Silva, Marcel R. Ackermann, Heike Fliegl, G. Gesese, Fidan Limani, Philipp Mayr, Peter Mutschke, A. Oelen, Muhammad Asif Suryani, Sharmila Upadhyaya, Benjamin Zapilko, Harald Sack, and Stefan Dietze. Research knowledge graphs in nfdi4datascience: Key activities, achievements, and future directions.ArXiv,...

work page arXiv
[23]

PaperBench: Evaluating AI's Ability to Replicate AI Research

URL https://api.semanticscholar.org/CorpusId:280421789. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evalu- ating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Testing the generalization power of neural net- work models across nli benchmarks

Aarne Talman and Stergios Chatzikyriakidis. Testing the generalization power of neural net- work models across nli benchmarks. InProceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 85–94,

work page 2019
[25]

Graph Attention Networks

URL https://api.semanticscholar.org/CorpusId:250048789. Petar Veliˇ ckovi´ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman

URL https://api.semanticscholar.org/ CorpusId:267570328. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,

work page 2018
[27]

Neural common neighbor with comple- tion for link prediction.arXiv preprint arXiv:2302.00890,

Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with comple- tion for link prediction.arXiv preprint arXiv:2302.00890,

work page arXiv
[28]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp. 1112–1122,

work page 2018
[29]

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.ArXiv, abs/2504.00255,

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.ArXiv, abs/2504.00255,

work page arXiv
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face.ArXiv, abs/2401.13822,

Xinyu Yang, Weixin Liang, and James Zou. Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face.ArXiv, abs/2401.13822,

work page arXiv
[32]

Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You

URLhttps://doi.org/10.1038/s41392-022-00994-0. Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You. Tinyscientist: An interactive, extensible, and controllable framework for building research agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sy...

work page doi:10.1038/s41392-022-00994-0 2025
[33]

Model (9827) Paper (1702) Codebase (1295) Dataset (1205) Figure 13:Visualization of collected artifact graph

15 Preprint A Visualization of Artifact Graph Figure 13 shows the full-size visualization of all nodes and edges we included in our collected artifact graph. Model (9827) Paper (1702) Codebase (1295) Dataset (1205) Figure 13:Visualization of collected artifact graph. B Limitations Computational considerations in verificationWhile our two-stage framework e...

work page 2025

[1] [1]

On the suitability of hug- ging face hub for empirical studies.ArXiv, abs/2307.14841,

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. On the suitability of hug- ging face hub for empirical studies.ArXiv, abs/2307.14841,

work page arXiv

[2] [2]

semanticscholar.org/CorpusId:260203268

URL https://api. semanticscholar.org/CorpusId:260203268. Joeran Beel, Min-Yen Kan, and Moritz Baumgart. Evaluating sakana’s ai scientist for autonomous research: Wishful thinking or an emerging reality towards ’artificial research intelligence’ (ari)?ArXiv, abs/2502.14297,

work page arXiv

[3] [3]

A large annotated corpus for learning natural language inference

Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 conference on empirical methods in natural language processing, pp. 632–642,

work page 2015

[4] [4]

Analyzing the evolution and maintenance of ml models on hugging face.2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), pp

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. Analyzing the evolution and maintenance of ml models on hugging face.2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), pp. 607–618,

work page 2024

[5] [5]

Joel Castaño, Rafael Cabañas, Antonio Salmer’on, David Lo, and Silverio Mart’inez- Fern’andez

URL https://api.semanticscholar.org/CorpusId:265351447. Joel Castaño, Rafael Cabañas, Antonio Salmer’on, David Lo, and Silverio Mart’inez- Fern’andez. How do machine learning models change?ArXiv, abs/2411.09645,

work page arXiv

[6] [6]

Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire

URLhttps://api.semanticscholar.org/CorpusId:274023512. Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks for link prediction with subgraph sketching.arXiv preprint arXiv:2209.15486,

work page arXiv

[7] [7]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,

work page 2018

[8] [8]

Horne, Gillian R

Ned Cooper, Tiffanie N. Horne, Gillian R. Hayes, Courtney Heldreth, Michal Lahav, Jess Holbrook, and Lauren Wilcox. A systematic review and thematic analysis of community- collaborative approaches to computing research.Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems,

work page 2022

[9] [9]

On the origin of llms: An evolutionary tree and graph for 15, 821 large language models.ArXiv, abs/2307.09793,

Sarah Gao and Andrew Gao. On the origin of llms: An evolutionary tree and graph for 15, 821 large language models.ArXiv, abs/2307.09793,

work page arXiv

[10] [10]

DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

URL https://api.semanticscholar.org/CorpusId:220070385. Peter Alexander Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.ArXiv, abs/2406.06769,

work page arXiv

[11] [11]

Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, D

URL https://api.semanticscholar.org/CorpusId: 270380311. Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, D. S. Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation.ArXiv, abs/2503.22708,

work page arXiv

[12] [12]

Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried

URL https: //api.semanticscholar.org/CorpusId:199001020. Gyeongwon James Kim, Alex Wilf, Louis philippe Morency, and Daniel Fried. From repro- duction to replication: Evaluating research agents with progressive code masking.ArXiv, abs/2506.19724,

work page arXiv

[13] [13]

Adam: A Method for Stochastic Optimization

URLhttps://api.semanticscholar.org/CorpusId:280000499. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Anatomy of a machine learning ecosystem: 2 million models on hugging face.ArXiv, abs/2508.06811,

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a machine learning ecosystem: 2 million models on hugging face.ArXiv, abs/2508.06811,

work page arXiv

[15] [15]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

URL https://api.semanticscholar.org/CorpusId: 234357246. Chris Lu, Cong Lu, R. T. Lange, J. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.ArXiv, abs/2408.06292,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

URLhttps://api.semanticscholar.org/CorpusId:271854887. T. Mari´ c, Dennis Gläser, Jan-Patrick Lehr, Ioannis Papagiannidis, B. Lambie, Christian H. Bischof, and Dieter Bothe. A pragmatic workflow for research software engineering in computational science.ArXiv, abs/2310.00960,

work page arXiv

[17] [17]

org/CorpusId:263605602

URL https://api.semanticscholar. org/CorpusId:263605602. Santiago Miret and Nandan M Krishnan. Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200,

work page arXiv

[18] [18]

Accessed: 2026-01-06

URL https://openai.com/index/ introducing-gpt-5-2/. Accessed: 2026-01-06. Mohammad Shahedur Rahman, Peng Gao, and Yuede Ji. Hugginggraph: Understanding the supply chain of llm ecosystem.ArXiv, abs/2507.14240,

work page arXiv 2026

[19] [19]

Paper2code: Automating code generation from scientific papers in machine learning.ArXiv, abs/2504.17192,

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2code: Automating code generation from scientific papers in machine learning.ArXiv, abs/2504.17192,

work page arXiv

[20] [20]

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Y

URLhttps://api.semanticscholar.org/CorpusId:273366814. Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Y. Zhuang. Taskbench: Benchmarking large language models for task automation.ArXiv, abs/2311.18760,

work page arXiv

[21] [21]

Zachary S

URL https://api.semanticscholar.org/ CorpusId:265506220. Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Trans. Mach. Learn. Res., 2024,

work page 2024

[22] [22]

Kanishka Silva, Marcel R

URL https: //arxiv.org/pdf/2409.11363.pdf. Kanishka Silva, Marcel R. Ackermann, Heike Fliegl, G. Gesese, Fidan Limani, Philipp Mayr, Peter Mutschke, A. Oelen, Muhammad Asif Suryani, Sharmila Upadhyaya, Benjamin Zapilko, Harald Sack, and Stefan Dietze. Research knowledge graphs in nfdi4datascience: Key activities, achievements, and future directions.ArXiv,...

work page arXiv

[23] [23]

PaperBench: Evaluating AI's Ability to Replicate AI Research

URL https://api.semanticscholar.org/CorpusId:280421789. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evalu- ating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Testing the generalization power of neural net- work models across nli benchmarks

Aarne Talman and Stergios Chatzikyriakidis. Testing the generalization power of neural net- work models across nli benchmarks. InProceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 85–94,

work page 2019

[25] [25]

Graph Attention Networks

URL https://api.semanticscholar.org/CorpusId:250048789. Petar Veliˇ ckovi´ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman

URL https://api.semanticscholar.org/ CorpusId:267570328. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,

work page 2018

[27] [27]

Neural common neighbor with comple- tion for link prediction.arXiv preprint arXiv:2302.00890,

Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with comple- tion for link prediction.arXiv preprint arXiv:2302.00890,

work page arXiv

[28] [28]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp. 1112–1122,

work page 2018

[29] [29]

Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.ArXiv, abs/2504.00255,

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.ArXiv, abs/2504.00255,

work page arXiv

[30] [30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face.ArXiv, abs/2401.13822,

Xinyu Yang, Weixin Liang, and James Zou. Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face.ArXiv, abs/2401.13822,

work page arXiv

[32] [32]

Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You

URLhttps://doi.org/10.1038/s41392-022-00994-0. Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You. Tinyscientist: An interactive, extensible, and controllable framework for building research agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sy...

work page doi:10.1038/s41392-022-00994-0 2025

[33] [33]

Model (9827) Paper (1702) Codebase (1295) Dataset (1205) Figure 13:Visualization of collected artifact graph

15 Preprint A Visualization of Artifact Graph Figure 13 shows the full-size visualization of all nodes and edges we included in our collected artifact graph. Model (9827) Paper (1702) Codebase (1295) Dataset (1205) Figure 13:Visualization of collected artifact graph. B Limitations Computational considerations in verificationWhile our two-stage framework e...

work page 2025