pith. sign in

arxiv: 2606.18237 · v1 · pith:UPRPS3TTnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI· cs.LG

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Pith reviewed 2026-06-27 00:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reproducibilityLLM agentsGitHub issuesmachine learning papersreproduction blockersscalable evaluationauditing
0
0 comments X

The pith

LLM agents surface at least one human-reported reproducibility blocker for roughly 90 percent of machine learning papers from paper and repository text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReproRepo as a framework that treats existing GitHub issues on paper repositories as ready-made labels for real reproduction problems. This replaces the manual curation required by prior benchmarks and lets evaluation reach 1,149 recent machine learning papers. Four frontier agent setups are tested; the strongest (Codex with GPT-5.5) matches at least one semantically related human issue in about 90 percent of cases even though the agents never execute code. Agents prove good at locating visible failures and the right semantic area but weaker at pinning down exact details. The same framework can be reused to keep testing new agents on fresh papers.

Core claim

ReproRepo treats human-raised GitHub issues on paper repositories as naturally occurring supervision that marks genuine reproduction blockers. On a corpus of 1,149 recent machine learning papers, LLM agents that receive only the paper text and repository contents (no code execution) identify at least one semantically related blocker for approximately 90 percent of the papers, with the Codex-plus-GPT-5.5 configuration performing best. The agents are especially reliable at surfacing visible failures and identifying the correct semantic region yet remain limited in exact localization.

What carries the argument

ReproRepo, the framework that converts human-raised GitHub issues into scalable, naturally occurring labels for evaluating LLM agents on paper-repository pairs.

If this is right

  • Reproducibility checks can be run at the scale of thousands of papers using only existing issue data.
  • LLM agents supply a practical first filter that catches most visible blockers before any code is run.
  • Evaluation effort can shift from labeling new examples to refining how agents localize issues more precisely.
  • ReproRepo itself becomes a reusable testbed for comparing future agent versions on the same real-world task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining the current text-only agents with lightweight code-execution steps could close the remaining gap in exact localization.
  • The same GitHub-issue approach might transfer to other fields that maintain public code repositories with issue trackers.
  • Patterns in the issues that agents consistently miss could guide targeted improvements in agent prompting or retrieval.
  • Over time the growing set of agent outputs could itself become a dataset for training more specialized reproduction checkers.

Load-bearing premise

Human-raised GitHub issues accurately represent the true reproducibility blockers and semantic relatedness between agent output and those issues is a sufficient signal that the agent has found the problem.

What would settle it

An independent expert review on a fresh sample of papers showing that the issues agents flag are not the actual blockers that prevent reproduction, or a new run on held-out papers where the semantic-match rate falls well below 90 percent.

read the original abstract

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ReproRepo, a scalable framework that treats human-raised GitHub issues on paper repositories as naturally occurring labels for reproducibility blockers. It evaluates four LLM agent configurations on 1,149 recent ML papers and reports that the strongest configuration (Codex with GPT-5.5) surfaces at least one semantically related human-reported blocker for ~90% of papers, even without code execution. The work positions this approach as a reusable alternative to manually curated reproducibility benchmarks and releases the associated code.

Significance. If the assumptions about issue validity and semantic relatedness hold, the framework provides a low-cost, scalable method for auditing LLM agents on realistic reproducibility tasks, which could accelerate evaluation beyond the small-scale manual benchmarks common in the field. The public code release is a concrete strength that supports future reuse and extension.

major comments (3)
  1. [Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.
  2. [Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.
  3. [Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.
minor comments (1)
  1. [Abstract] The model name 'Codex with GPT-5.5' is non-standard and should be clarified with exact API identifiers or version numbers used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below and commit to revisions that strengthen the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.

    Authors: We agree that additional detail is required. The current manuscript describes the matching procedure at a high level but does not provide the precise operationalization or validation statistics. In the revision we will expand the Methods section with the exact procedure (embedding model and threshold or LLM judge prompt), report inter-annotator agreement from a human validation study on a sampled subset, and clarify the criteria used to confirm that matched issues describe reproducibility blockers. revision: yes

  2. Referee: [Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.

    Authors: We will add an explicit subsection on dataset construction that reports the repository and issue selection criteria together with summary statistics (e.g., proportion of issues manually categorized as reproducibility-related versus installation queries or feature requests) on a representative sample. This will allow readers to assess the composition of the supervision signal. revision: yes

  3. Referee: [Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.

    Authors: We accept that the current presentation lacks supporting detail. The revision will include (i) quantitative breakdowns of success rates stratified by failure visibility and localization granularity, (ii) representative agent output examples, and (iii) a dedicated error-analysis subsection that distinguishes semantic-region matches from precise localization failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external human-generated labels as independent ground truth

full rationale

The paper's central evaluation compares LLM agent outputs against pre-existing human-raised GitHub issues on paper repositories, treating those issues as naturally occurring external supervision. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction; the reported ~90% figure is a direct empirical match rate against an independent dataset. The framework is self-contained against these external benchmarks, with no load-bearing self-citations or ansatzes that collapse the claim into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical framework built on existing public data.

pith-pipeline@v0.9.1-grok · 5764 in / 1152 out tokens · 50612 ms · 2026-06-27T00:58:28.008852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...

  2. [2]

    Daniel Nüst and Stephen J Eglen. CODECHECK: an open science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve re- producibility.F1000Research, 10:253, 2021. doi: 10.12688/f1000research.51738.2. URLhttps: //f1000research.com/articles/10-253/v2. [version 2; peer review: 2 approved]

  3. [3]

    PaperBench: Evaluating AI’s ability to replicate AI research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedin...

  4. [4]

    Paper2Code: Automating code generation from scientific papers in machine learning

    Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3DcaUTjdKc

  5. [5]

    CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024

    Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=BsMMc4MEGS

  6. [6]

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO- bench: Can agentic AI systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, Vienna, Austria,

  7. [7]

    doi: 10.18653/v1/2025.findings-acl.1210

    Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1210. URL https://aclanthology.org/2025.findings-acl.1210/

  8. [8]

    Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

    Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian LV Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, et al. Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

  9. [9]

    Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

    Syed Mehtab Hussain Shah, Frank Hopfgartner, and Arnim Bleier. Automating computational reproducibility in social science: Comparing prompt-based and agent-based approaches.arXiv preprint arXiv:2602.08561, 2026. doi: 10.48550/arXiv.2602.08561. URLhttps://arxiv.org/ abs/2602.08561. 12 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues

  10. [10]

    AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

    Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoReproduce: Automatic AI experiment reproduction with paper lineage.arXiv preprint arXiv:2505.20662, 2025. doi: 10.48550/arXiv.2505.20662. URL https://arxiv.org/abs/2505.20662. Accepted by ACL 2026 Main

  11. [11]

    The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

    Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, and Chenhao Tan. The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

  12. [12]

    Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Replication and Reanalysis

    Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large-scale replication and reanalysis.arXiv preprint arXiv:2602.16733, 2026. doi: 10.48550/arXiv.2602.16733. URLhttps://arxiv.org/abs/2602.16733

  13. [13]

    Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

    Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, and Elliott Ash. Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

  14. [14]

    ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

    BangNguyen, DominikSoós, QianMa, RochanaRObadage, ZackRanjan, SaiKoneru, AnnaSzabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, and Meng Jiang. ReplicatorBench: Benchmarking LLM agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. doi: 10.48550/arXiv.2602.11354. URLhttps://a...

  15. [15]

    Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023

    Ian Magnusson, Noah A Smith, and Jesse Dodge. Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023. doi: 10.18653/v1/2023.findings-acl.809. URLhttps://aclanthology. org/2023.findings-acl.809/

  16. [16]

    ML code completeness checklist

    Robert Stojnic. ML code completeness checklist. Papers with Code Blog, 2020. URL https: //medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501

  17. [17]

    SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. doi: 10.48550/ arXiv.2310.06770. URLhttps://arxiv.org/abs/2310.06770

  18. [18]

    MLE-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=6s5uXNWGIh

  19. [19]

    SciCode: A research coding benchmark curated by scientists

    Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

  20. [20]

    Measuring risk of bias in biomedical reports: The RoBBR benchmark

    Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

  21. [21]

    Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024

    Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B Shah. Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024. doi: 10.48550/arXiv.2411. 03417. URLhttps://arxiv.org/abs/2411.03417

  22. [22]

    ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023

    Ryan Liu and Nihar Shah. ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023. AAAI 2024 Workshop on Scientific Document Understanding

  23. [23]

    When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

    Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025. doi: 10.48550/arXiv.2505.11855. URLhttps://arxiv.org/abs/2505.11855

  24. [24]

    Guo and Y

    Sarina Xi, Vishisht Rao, Justin Payan, and Nihar B Shah. FLAWS: A benchmark for error identification and localization in scientific papers.arXiv preprint arXiv:2511.21843, 2025. doi: 10.48550/arXiv. 2511.21843. URLhttps://arxiv.org/abs/2511.21843

  25. [25]

    Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026

    Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026. URLhttps://arxiv.org/abs/2605.30329

  26. [26]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InI...

  27. [27]

    The more you automate, the less you see: The hidden pitfalls of AI scientist systems

    Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: The hidden pitfalls of AI scientist systems. InNeurIPS 2025 AI for Science Workshop, 2025. URL https://openreview.net/forum?id=7Sndugns1l

  28. [28]

    Xing, and Zhiting Hu

    Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, and Zhiting Hu. FIRE-bench: Evaluating agents on the rediscovery of scientific insights.arXiv preprint arXiv:2602.02905, 2026. doi: 10.48550/arXiv.2602.02905. URL https://arxiv.org/abs/2602.02905

  29. [29]

    Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025

    Mingyang Zhou, Quanming Yao, Lun Du, Lanning Wei, and Da Zheng. Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025. doi: 10.48550/arXiv.2508.16671. URLhttps://arxiv.org/abs/2508.16671

  30. [30]

    FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research

    Hui Chen, James Xu Zhao, Dongfu Jiang, Qianyun Guo, Jiefeng Chen, Yiwei Wang, Muhao Chen, See-Kiong Ng, Pang Wei Koh, and Bryan Hooi. FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research. InICML 2026 AI for Science Workshop, 2026. URLhttps://openreview. net/f...

  31. [31]

    PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026

    Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, and Yong Li. PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026. doi: 10.48550/arXiv.2603.00058. URLhttps://arxiv.org/abs/2603.00058

  32. [32]

    Paper Copilot: Tracking the evolution of peer review in AI conferences

    Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper Copilot: Tracking the evolution of peer review in AI conferences. InInternational Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=CyKVrhNABo

  33. [33]

    DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

  34. [34]

    System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April

    Anthropic. System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April

  35. [35]

    Introducing GPT-5.4 mini and nano

    OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed 2026-05-25

  36. [36]

    GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April

    OpenAI. GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April

  37. [37]

    reproducibility_assessment

    Accessed 2026-05-25. 15 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues A. Artifact Use, Licenses, & Intended Use Our study builds on existing public artifacts, including conference paper metadata, public GitHub reposito- ries, GitHub issue threads, and repository links discovered from Paper Copilot and conference metadata. We use...