pith. sign in

arxiv: 2606.25879 · v1 · pith:P26OQXFYnew · submitted 2026-06-24 · 💻 cs.DC · cs.AI

AI-Assisted Computational Reproducibility on the FABRIC Testbed

Pith reviewed 2026-06-25 19:17 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords computational reproducibilityAI coding assistantsFABRIC testbedexperiment reproductionBBR congestion controlLAMMPS molecular dynamicsgenomics pipelines
0
0 comments X

The pith

AI coding assistants on the FABRIC testbed cut reproduction effort for published experiments by a factor of four to six.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how large language model coding assistants paired with the FABRIC testbed can reproduce experiments across congestion control, molecular dynamics, and genomics. The reproduced runs supported the same scientific conclusions as the originals even when numerical outputs were not matched exactly. The AI handled environment setup, code adaptation, and debugging effectively, but analysis steps with unclear workflows still required human input to define order and dependencies. This combination produced a measured reduction in overall reproduction effort.

Core claim

Across the three case studies, the AI-assisted workflow on FABRIC reduced reproduction effort by roughly 4--6 times while the reproduced experiments supported the same scientific conclusions as the original studies.

What carries the argument

AI-assisted workflow that uses large language model coding assistants through LoomAI on the FABRIC testbed to automate setup, code adaptation, and debugging.

If this is right

  • Reproduction success is judged by whether the same scientific conclusions are reached rather than by exact numerical match.
  • AI tools prove useful for environment setup and debugging but need human guidance when analysis lacks a clearly defined workflow.
  • The approach yields concrete recommendations for using AI assistants on shared research testbeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same testbed-plus-AI pattern could be tested on other distributed research infrastructures.
  • Adding tools that automatically extract data dependencies might reduce the remaining human effort in analysis stages.
  • Repeating the exercise with a larger and more varied set of published papers would test whether the 4--6x factor generalizes.

Load-bearing premise

That effort reduction can be measured consistently across domains and that the three chosen case studies are representative without selection bias.

What would settle it

A side-by-side measurement of person-hours required to reproduce the same three experiments once with the AI assistant and once without it.

Figures

Figures reproduced from arXiv: 2606.25879 by Berent Aldikacti, Komal Thareja, Michael Zink, Paul Ruth.

Figure 1
Figure 1. Figure 1: Four-phase AI-assisted reproducibility methodology. All phases involve both AI and human collaboration. descriptions. The researcher reviews generated code before execution and provides corrections. In Phase 3 (Execution on FABRIC), the AI executes notebooks on FABRIC nodes via Secure Shell (SSH), installs dependencies, runs experiments, and collects results, typically 5–15 tool calls per prompt. The AI mo… view at source ↗
Figure 2
Figure 2. Figure 2: LAMMPS strong scaling on FABRIC. Short-range potentials (LJ, EAM) track ideal scaling through 32 cores. Long-range poten￾tials (Rhodopsin, SPC/E) plateau at the multi-node boundary. Dashed line = ideal linear scaling. configured on a 5-node cluster provisioned at TACC with one head node (32 cores, 128 GB RAM) and four workers (16 cores, 64 GB RAM each), totaling 96 cores and 384 GB RAM connected via L2Brid… view at source ↗
Figure 3
Figure 3. Figure 3: Reproduced gene classification across four strains and nine stress conditions. Heat stress (HT:H) produces the most non-neutral genes in WT and ∆lon, consistent with the paper. on FABRIC, and the reproduction was assessed as substan￾tially reproduced (i.e., ≥75% of conclusion-level claims sup￾ported or partially supported). Table V summarizes the quan￾titative comparison and [PITH_FULL_IMAGE:figures/full_… view at source ↗
read the original abstract

Computational reproducibility remains difficult despite being central to scientific research. In this paper, we show how the international FABRIC testbed, combined with large language model (LLM) coding assistants through LoomAI, can simplify reproducing published experiments across multiple domains. We reproduced three case studies on FABRIC, covering BBR-family congestion-control evaluations, LAMMPS molecular dynamics scaling benchmarks on a CPU-only MPI cluster, and stress protein homeostasis genomics pipelines. Rather than focusing only on matching numerical outputs, we evaluate whether the reproduced experiments support the same scientific conclusions as the original studies. The AI assistant was effective in setting up the environment, adapting code, and debugging, but struggled with the analysis stages that lacked clearly defined workflows, which required human guidance to establish execution order and data dependencies. Across the case studies, the AI-assisted workflow reduced reproduction effort by roughly 4--6 times. We conclude with practical recommendations for improving AI-assisted reproducibility on research testbeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes three case studies reproducing published experiments (BBR congestion-control evaluations, LAMMPS molecular-dynamics scaling benchmarks, and a genomics protein-homeostasis pipeline) on the FABRIC testbed with the assistance of LLM coding tools via LoomAI. Rather than exact numerical matching, the authors assess whether the reproduced runs support the original scientific conclusions. They report that the AI assistant handled environment setup, code adaptation, and debugging effectively but required human intervention for analysis stages lacking clear workflows. The central claim is that the AI-assisted approach reduced reproduction effort by a factor of roughly 4–6 across the cases, accompanied by practical recommendations for AI-assisted reproducibility on research testbeds.

Significance. If the effort-reduction factor can be placed on a reproducible, objective footing with explicit baselines and measurement protocols, the work would supply concrete, domain-spanning evidence on the current capabilities and limits of LLM assistants for computational reproducibility tasks. Such evidence is scarce and would be useful to both testbed operators and researchers seeking to lower barriers to reproduction.

major comments (2)
  1. [Abstract and case-study results sections] The 4–6× effort-reduction claim (stated in the abstract and repeated in the conclusions) is load-bearing for the paper’s contribution, yet no measurement protocol, definition of “effort,” or controlled baseline is supplied. It is therefore impossible to determine whether the factor derives from logged person-hours, number of human–AI turns, wall-clock time, or post-hoc author estimates, nor whether identical tasks were performed with and without the AI assistant.
  2. [Introduction and case-study selection] The three chosen case studies (BBR, LAMMPS, genomics) are presented as representative, but the manuscript provides no discussion of selection criteria or potential bias toward domains where LLM assistance is unusually effective. Without such justification, generalization of the reported speedup remains unsupported.
minor comments (2)
  1. [Abstract] The abstract states that the AI “struggled with the analysis stages that lacked clearly defined workflows,” but the manuscript does not enumerate which specific analysis steps required human guidance or how those steps were ultimately resolved.
  2. [Methods / LoomAI description] Notation for the LoomAI interface and the precise version of the LLM used is introduced without a dedicated methods subsection, making it difficult for readers to replicate the exact assistant configuration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where the manuscript can be strengthened by adding explicit methodological detail. We address each point below and will incorporate revisions in the next version.

read point-by-point responses
  1. Referee: [Abstract and case-study results sections] The 4–6× effort-reduction claim (stated in the abstract and repeated in the conclusions) is load-bearing for the paper’s contribution, yet no measurement protocol, definition of “effort,” or controlled baseline is supplied. It is therefore impossible to determine whether the factor derives from logged person-hours, number of human–AI turns, wall-clock time, or post-hoc author estimates, nor whether identical tasks were performed with and without the AI assistant.

    Authors: We agree that the current manuscript lacks an explicit measurement protocol. The reported 4–6× factor reflects the authors’ post-hoc estimates of total person-hours spent on each reproduction, based on detailed interaction logs with the LLM assistant and our prior experience performing comparable tasks without AI assistance. No controlled, side-by-side experiment with identical tasks was conducted. In revision we will add a dedicated subsection that (1) defines effort as cumulative person-hours, (2) describes the logging of human–AI turns and time stamps, and (3) states the limitations of the baseline estimation. We will also qualify the claim in the abstract and conclusions to reflect this methodology. revision: yes

  2. Referee: [Introduction and case-study selection] The three chosen case studies (BBR, LAMMPS, genomics) are presented as representative, but the manuscript provides no discussion of selection criteria or potential bias toward domains where LLM assistance is unusually effective. Without such justification, generalization of the reported speedup remains unsupported.

    Authors: The three experiments were chosen to cover distinct computational domains (networking, molecular dynamics, and bioinformatics) that are commonly reproduced on testbeds and that exercise different aspects of the FABRIC environment. We will revise the Introduction to state these selection criteria explicitly, note that the domains were not chosen to maximize LLM success, and add a short limitations paragraph discussing possible selection bias and the consequent limits on generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case-study report with no derivations or self-referential predictions

full rationale

The paper reports three case studies (BBR, LAMMPS, genomics) on AI-assisted reproduction using the FABRIC testbed and LoomAI. The central claim of 4-6x effort reduction is presented as an empirical observation from these reproductions rather than a derived prediction, fitted parameter, or result obtained via equations. No mathematical derivations, ansatzes, uniqueness theorems, or self-citation chains appear in the provided text. The work is self-contained as a descriptive evaluation of practical workflows and does not reduce any load-bearing claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No formal axioms or invented entities; the work rests on the unstated domain assumption that 'supporting the same scientific conclusions' can be judged by qualitative inspection of outputs.

pith-pipeline@v0.9.1-grok · 5701 in / 976 out tokens · 30985 ms · 2026-06-25T19:17:51.075192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 4 linked inside Pith

  1. [1]

    1,500 scientists lift the lid on reproducibility,

    M. Baker, “1,500 scientists lift the lid on reproducibility,”Nature, vol. 533, no. 7604, pp. 452–454, 2016

  2. [2]

    An empirical analysis of journal policy effectiveness for computational reproducibility,

    V . Stodden, J. Seiler, and Z. Ma, “An empirical analysis of journal policy effectiveness for computational reproducibility,”Proceedings of the National Academy of Sciences, vol. 115, no. 11, pp. 2584–2589, 2018

  3. [3]

    Repeatability in computer systems research,

    C. Collberg and T. A. Proebsting, “Repeatability in computer systems research,”Communications of the ACM, vol. 59, no. 3, pp. 62–69, 2016

  4. [4]

    Lessons learned from the Chameleon testbed,

    K. Keahey, J. Anderson, Z. Zhen, P. Riteau, P. Ruth, D. Stanzione, M. Cevik, J. Colleran, H. S. Gunawi, C. Hammock, J. Mambretti, A. Barnes, F. Halbach, A. Roez, and J. Tracey, “Lessons learned from the Chameleon testbed,” inProceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC’20), 2020, pp. 219–233

  5. [5]

    The design and operation of CloudLab,

    D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide, L. Stoller, M. Hibler, D. Johnson, K. Webb, A. Naber, N. Ezzelle, and J. Stutzman, “The design and operation of CloudLab,” inProceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC’19), 2019, pp. 1–14

  6. [6]

    FABRIC: A national-scale programmable experimental network infras- tructure,

    I. Baldin, A. Mandal, P. Ruth, R. McGeer, J. Chase, and T. Nyczyk, “FABRIC: A national-scale programmable experimental network infras- tructure,” inIEEE Internet Computing, vol. 23, no. 6, 2019, pp. 38–47

  7. [7]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Claude: AI assistant by Anthropic,

    Anthropic, “Claude: AI assistant by Anthropic,” https://www.anthropic. com/claude, 2024, accessed: 2026-04-01

  9. [9]

    LoomAI: An AI-augmented interface for designing, deploying, and automating experiments on FABRIC,

    P. Ruth and K. Thareja, “LoomAI: An AI-augmented interface for designing, deploying, and automating experiments on FABRIC,” in Practice and Experience in Advanced Research Computing (PEARC ’26). ACM, 2026, to appear

  10. [10]

    Performance of molecular dynamics acceleration strategies on composable cyberinfrastructure,

    R. Lawrence, D. K. Chakravorty, F. Dang, L. M. Perez, W. Brashear, Z. He, H. Liu, J. X. Mao, and C.-Y . Lu, “Performance of molecular dynamics acceleration strategies on composable cyberinfrastructure,” in Practice and Experience in Advanced Research Computing (PEARC ’24). ACM, 2024, pp. 1–5

  11. [11]

    Stress testing reveals selective vulnerabilities in protein homeostasis,

    B. Aldikactiet al., “Stress testing reveals selective vulnerabilities in protein homeostasis,”Cell Reports, 2026, in press

  12. [12]

    Claude code: AI-powered coding assistant CLI,

    Anthropic, “Claude code: AI-powered coding assistant CLI,” https: //docs.anthropic.com/en/docs/claude-code, 2025, accessed: 2026-04-01

  13. [13]

    Artifact review and badging, version 1.1,

    ACM, “Artifact review and badging, version 1.1,” https://www.acm. org/publications/policies/artifact-review-and-badging-current, 2020, ac- cessed: 2026-04-01

  14. [14]

    Characterization of leptazolines A–D, polar non-ribosomal peptides of the associated cyanobacterium,

    J. Bhandari Neupane, R. P. Neupane, Y . Luo, W. Y . Yoshida, R. Sun, and P. G. Williams, “Characterization of leptazolines A–D, polar non-ribosomal peptides of the associated cyanobacterium,”Molecules, vol. 27, no. 7, p. 2233, 2022, placeholder – replace with actual bioinformatics reproducibility citation

  15. [15]

    BBR: Congestion-based congestion control,

    N. Cardwell, Y . Cheng, C. S. Gunn, S. H. Yeganeh, and V . Jacobson, “BBR: Congestion-based congestion control,” inCommunications of the ACM, vol. 60, no. 2, 2017, pp. 58–66

  16. [16]

    BBRv3: Algorithm bug fixes and public internet deployment,

    N. Cardwell, Y . Cheng, S. H. Yeganeh, I. Swett, and V . Jacobson, “BBRv3: Algorithm bug fixes and public internet deployment,” IETF 115 Presentation, 2022, replace with the specific BBRv3 paper(s) being reproduced

  17. [17]

    LAMMPS — a flexible simulation tool for particle- based materials modeling at the atomic, meso, and continuum scales,

    A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, “LAMMPS — a flexible simulation tool for particle- based materials modeling at the atomic, meso, and continuum scales,” Computer Physics Com...

  18. [18]

    Sustainable data analysis with Snakemake,

    F. M ¨older, K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch, V . Sochat, J. Forster, S. Lee, S. O. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, A. Narechania, and J. K ¨oster, “Sustainable data analysis with Snakemake,”F1000Research, vol. 10, p. 33, 2021

  19. [19]

    Nextflow enables reproducible computational work- flows,

    P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational work- flows,”Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017

  20. [20]

    GPT-4 technical report,

    OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  21. [21]

    Gemini: A family of highly capable multimodal models,

    Google DeepMind, “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2024

  22. [22]

    Autonomous chemical research with large language models,

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, “Autonomous chemical research with large language models,”Nature, vol. 624, pp. 570–578, 2023

  23. [23]

    SciPredict: Can LLMs predict the outcomes of scientific experiments in natural sciences?

    U. M. Sehwag, E. Lau, H. Ehsani Oskouie, S. Shabihi, E. Liang, A. Toledo, G. Mangialardi, S. Fonrouge, E.-Y . Hernandez Cardona, P. Vergara, U. Tyagi, C. B. C. Zhang, P. Bhatter, N. Johnson, F. Huang, E. G. Hernandez Montoya, and B. Liu, “SciPredict: Can LLMs predict the outcomes of scientific experiments in natural sciences?”arXiv preprint arXiv:2604.10718, 2026

  24. [24]

    Barbarians at the gate: How AI is upending systems research,

    A. Cheng, S. Liu, M. Pan, Z. Li, B. Wang, A. Krentsel, T. Xia, M. Cemri, J. Park, S. Yang, J. Chen, L. Agrawal, A. Desai, J. Xing, K. Sen, M. Zaharia, and I. Stoica, “Barbarians at the gate: How AI is upending systems research,”arXiv preprint arXiv:2510.06189, 2025

  25. [25]

    Generalization bias in large language model summarization of scientific research,

    U. Peters and B. Chin-Yee, “Generalization bias in large language model summarization of scientific research,”Royal Society Open Science, vol. 12, no. 4, p. 241776, 2025

  26. [26]

    Agentic LLM pipelines for reproducible scientific software: Opportunities and challenges,

    A. Adashchik, A. Huraira, Z. Kholmatova, A. Mikriukov, A. Ravveduto, M. Snigireva, G. Succi, A. Tormasov, and E. A. Trofimova, “Agentic LLM pipelines for reproducible scientific software: Opportunities and challenges,” inProceedings of the 9th International Conference on Computer Science and Artificial Intelligence (CSAI ’25). ACM, 2025, pp. 38–46

  27. [27]

    Some of the internet may be heading towards BBR dominance: An experimental study,

    A. Srivastava, F. Fund, and S. S. Panwar, “Some of the internet may be heading towards BBR dominance: An experimental study,” inIEEE INFOCOM 2023 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2023, pp. 1–7

  28. [28]

    Understanding the performance of TCP BBRv2 using FABRIC,

    J. Gomez, E. Kfoury, J. Crichigno, and G. Srivastava, “Understanding the performance of TCP BBRv2 using FABRIC,” in2023 IEEE In- ternational Black Sea Conference on Communications and Networking (BlackSeaCom), 2023, pp. 259–264

  29. [29]

    bbr2: Scripts for an emulation-based evaluation of TCP BBRv2 alpha,

    J. Gomez Gaona, “bbr2: Scripts for an emulation-based evaluation of TCP BBRv2 alpha,” https://github.com/gomezgaona/bbr2, 2023, ac- cessed: 2026-05-27

  30. [30]

    bbr3: Resources for BBRv3 performance evaluation,

    ——, “bbr3: Resources for BBRv3 performance evaluation,” https: //github.com/gomezgaona/bbr3, 2024, accessed: 2026-05-27

  31. [31]

    Replication: “when to use and when not to use BBR

    S. Datta and F. Fund, “Replication: “when to use and when not to use BBR”,” inProceedings of the 2023 ACM Internet Measurement Conference, 2023, pp. 29–34

  32. [32]

    imcbbrrepro: Artifacts for replication: “when to use and when not to use BBR

    ——, “imcbbrrepro: Artifacts for replication: “when to use and when not to use BBR”,” https://github.com/sdatta97/imcbbrrepro, 2023, accessed: 2026-05-27

  33. [33]

    When to use and when not to use BBR: An empirical analysis and evaluation study,

    Y . Cao, A. Jain, K. Sharma, A. Balasubramanian, and A. Gandhi, “When to use and when not to use BBR: An empirical analysis and evaluation study,” inProceedings of the 2019 Internet Measurement Conference, 2019, pp. 130–136

  34. [34]

    BBR’s sharing behavior with CUBIC and Reno,

    F. B. Sarpkaya, A. Srivastava, F. Fund, and S. Panwar, “BBR’s sharing behavior with CUBIC and Reno,”arXiv preprint arXiv:2505.07741, 2025

  35. [35]

    TCP BBR behavior over a shared bottleneck: experiment arti- facts,

    ——, “TCP BBR behavior over a shared bottleneck: experiment arti- facts,” https://github.com/fatihsarpkaya/bbr-shared-bottleneck, 2025, ac- cessed: 2026-05-27

  36. [36]

    A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,

    R. K. Jain, D.-M. W. Chiu, and W. R. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,”DEC Research Report TR-301, 1984, widely cited as Jain’s fairness index

  37. [37]

    Supplemental documents for PEARC24: Performance of molecular dynamics acceleration strategies on composable cyberinfras- tructure,

    R. Lawrence, “Supplemental documents for PEARC24: Performance of molecular dynamics acceleration strategies on composable cyberinfras- tructure,” https://github.com/rarensu/pearc24-LAMMPS-supplement, 2024

  38. [38]

    lammps-reproducibility: AI-assisted reproduction of LAMMPS MPI scaling benchmarks on FABRIC,

    K. Thareja, “lammps-reproducibility: AI-assisted reproduction of LAMMPS MPI scaling benchmarks on FABRIC,” https://github.com/ kthare10/lammps-reproducibility, 2026, accessed: 2026-05-11

  39. [39]

    ComBat-seq: batch effect adjustment for RNA-seq count data,

    Y . Zhang, G. Parmigiani, and W. E. Johnson, “ComBat-seq: batch effect adjustment for RNA-seq count data,”NAR Genomics and Bioinformatics, vol. 2, no. 3, p. lqaa078, 2020

  40. [40]

    Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection,

    E. Cand `es, Y . Fan, L. Janson, and J. Lv, “Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection,”Journal of the Royal Statistical Society: Series B, vol. 80, no. 3, pp. 551–577, 2018

  41. [41]

    The earth mover’s distance as a metric for image retrieval,

    Y . Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,”International Journal of Computer Vision, vol. 40, no. 2, pp. 99–121, 2000

  42. [42]

    A Bayesian nonpara- metric model for inferring subclonal populations from structured DNA sequencing data,

    A. Schein, S. He, V . Sarsani, and P. Flaherty, “A Bayesian nonpara- metric model for inferring subclonal populations from structured DNA sequencing data,”Annals of Applied Statistics, vol. 15, no. 2, 2021

  43. [43]

    Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,

    M. I. Love, W. Huber, and S. Anders, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,”Genome Biology, vol. 15, no. 12, p. 550, 2014

  44. [44]

    tnseq-homeostasis: Multilevel Tn-seq analysis,

    P. Flaherty, “tnseq-homeostasis: Multilevel Tn-seq analysis,” https:// github.com/flahertylab/tnseq-homeostasis, 2024

  45. [45]

    stress-protein-homeostasis: AI-assisted reproduction of stress protein homeostasis analysis on FABRIC,

    K. Kthare, “stress-protein-homeostasis: AI-assisted reproduction of stress protein homeostasis analysis on FABRIC,” https://github.com/ kthare10/stress-protein-homeostasis, 2026, accessed: 2026-05-10