arxiv: 2604.06414 · v2 · submitted 2026-04-07 · 💻 cs.HC

Recognition: 2 theorem links

· Lean Theorem

Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning

Zhiwei Li , Carl Kesselman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.HC

keywords machine learning reproducibilitycollaborative machine learningsocio-technical systemsinteractional supportclinical researchAI mediationartifact managementhuman-computer interaction

0 comments

The pith

Reproducibility in collaborative ML fails when teams cannot interpret prior work or align evolving components, even with complete artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that common ML reproducibility efforts, centered on capturing datasets, code, configurations, and environments, overlook key problems in team-based projects. Through observations from a 19-month clinical research deployment, it shows recurring difficulties in making sense of past experiments, coordinating changes across components, and recovering original intent despite full traceability records. In response, the work outlines a two-layer socio-technical system that pairs artifact lifecycle infrastructure with an added interactional layer using AI to support coordination, explanations, and shared understanding among participants. This shifts reproducibility from a fixed property of stored traces to an active, ongoing process. Readers would care because it explains why many interdisciplinary ML efforts still cannot reliably reuse or extend earlier results and identifies design directions for tools that address human coordination gaps.

Core claim

Observations from a 19-month deployment of a data-centric ML management system in clinical research reveal recurring interactional breakdowns that continue even when structural traceability of artifacts is comprehensive. The central proposal is a two-layer socio-technical ML management system that combines lifecycle-aware artifact infrastructure with an interactional layer to mediate coordination, explanation, and shared understanding. An AI-mediated semantic interface is positioned to help participants reconstruct experimental intent over time, reframing reproducibility as a dynamic socio-technical accomplishment rather than a static feature of recorded traces.

What carries the argument

A two-layer socio-technical ML management system, consisting of a lifecycle-aware artifact infrastructure layer paired with an AI-mediated interactional layer that supports coordination, explanation, and shared understanding.

If this is right

Reproducibility is treated as an ongoing socio-technical process rather than a one-time property of stored artifacts.
AI-mediated semantic interfaces can assist teams in aligning evolving components and recovering experimental intent.
ML management systems should incorporate dedicated layers for interactional support alongside technical traceability features.
Interdisciplinary projects gain mechanisms for maintaining shared understanding across changing team compositions and component versions.
Design of human-centered ML infrastructure prioritizes mediation of coordination and explanation needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interactional issues could appear in collaborative projects outside clinical settings, such as in scientific computing or engineering teams.
The proposed AI mediation layer might integrate with existing experiment-tracking platforms to reduce overhead in large groups without requiring full system replacement.
Testing the system in shorter or smaller-scale projects could reveal whether the benefits scale down or depend on long-duration team dynamics.
Extending the approach beyond ML could address reproducibility challenges in other data-intensive collaborative research fields.

Load-bearing premise

The interactional breakdowns documented in one 19-month clinical ML deployment are representative of collaborative ML projects in general and an added AI-mediated interactional layer would resolve them in practice.

What would settle it

A follow-up deployment of the two-layer system in another collaborative ML project that shows no measurable reduction in coordination breakdowns or failed attempts to reconstruct intent, or an equivalent project that achieves full reproducibility using only artifact-complete systems without interactional support.

Figures

Figures reproduced from arXiv: 2604.06414 by Carl Kesselman, Zhiwei Li.

**Figure 1.** Figure 1: A two-layer socio-technical ML management system. The structural layer captures ML artifacts and their evolution. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Machine learning (ML) reproducibility is often framed as a problem of incomplete artifact recording. This framing leads to systems that prioritize capturing datasets, code, configurations, and execution environments. However, in collaborative and interdisciplinary ML projects, reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, we identify recurring interactional breakdowns that persist despite comprehensive structural traceability. Based on these findings, we propose a two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer designed to mediate coordination, explanation, and shared understanding. We discuss how an AI-mediated semantic interface reframes reproducibility as an ongoing socio-technical accomplishment rather than a static property of recorded traces, and outline implications for human-centered ML infrastructure design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows real interactional breakdowns in one long clinical ML project and proposes a two-layer socio-technical fix, but the evidence is narrow and the system untested.

read the letter

The core observation here is that even with solid artifact tracking, collaborative ML teams in interdisciplinary settings still hit problems reconstructing intent, aligning on changes, and making sense of prior decisions over months. The authors document this from their 19-month deployment in a clinical research project, where these issues kept coming up despite the infrastructure they had in place. That part feels grounded and worth noting because it moves the reproducibility discussion past just missing files or code.

Referee Report

2 major / 1 minor

Summary. The paper claims that ML reproducibility in collaborative, interdisciplinary settings often fails due to interactional issues—interpreting prior work, aligning evolving components, and reconstructing experimental intent—beyond incomplete artifacts. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, it identifies persistent breakdowns despite structural traceability. The authors propose a two-layer socio-technical system (lifecycle-aware artifact infrastructure plus an AI-mediated interactional layer) to support coordination, explanation, and shared understanding, reframing reproducibility as an ongoing socio-technical accomplishment with implications for human-centered ML infrastructure.

Significance. If the observations hold more broadly and the proposed mediation layer proves effective, the work could meaningfully shift reproducibility research from artifact-centric recording toward socio-technical support in HCI and ML systems. The empirical grounding in a longitudinal real-world deployment is a clear strength, providing concrete examples of breakdowns that current tools do not address.

major comments (2)

[Deployment study and findings] The central generalization that interactional breakdowns 'often' cause reproducibility failures (abstract and introduction) rests on observations from a single 19-month clinical deployment. No comparative data from other team types, domains, or project scales are presented, leaving the prevalence and representativeness unestablished. This directly affects the motivation for the two-layer proposal.
[Proposed system and discussion] The two-layer socio-technical system (artifact infrastructure plus AI-mediated interactional layer) is presented conceptually in the design proposal section with no prototype implementation, evaluation, or comparison against existing tools. The claim that the AI-mediated semantic interface would resolve the observed coordination and intent-reconstruction issues therefore remains untested and is load-bearing for the paper's contribution.

minor comments (1)

[Discussion] The transition from specific deployment observations to the proposed design implications could be made more explicit with a table or diagram mapping breakdowns to system features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the empirical grounding and for highlighting areas where we can strengthen the manuscript. We address the two major comments below, proposing targeted revisions to qualify our claims and clarify the scope of the proposed system.

read point-by-point responses

Referee: The central generalization that interactional breakdowns 'often' cause reproducibility failures (abstract and introduction) rests on observations from a single 19-month clinical deployment. No comparative data from other team types, domains, or project scales are presented, leaving the prevalence and representativeness unestablished. This directly affects the motivation for the two-layer proposal.

Authors: We agree that our findings are drawn from a single 19-month deployment in a clinical research context, which precludes strong claims about prevalence across all collaborative ML settings. The paper positions this as an in-depth case study typical in HCI research to uncover nuanced interactional issues that may be overlooked in broader surveys. To address this, we will revise the abstract, introduction, and discussion sections to temper the language from 'often' to 'can frequently' or 'in collaborative interdisciplinary settings such as the one studied,' and explicitly note the limitations of generalizability. We will also add a dedicated subsection on study limitations and suggest directions for future comparative research. This revision preserves the motivation for the two-layer system, as the observed breakdowns provide specific, actionable insights into where current artifact-based approaches fall short. revision: partial
Referee: The two-layer socio-technical system (artifact infrastructure plus AI-mediated interactional layer) is presented conceptually in the design proposal section with no prototype implementation, evaluation, or comparison against existing tools. The claim that the AI-mediated semantic interface would resolve the observed coordination and intent-reconstruction issues therefore remains untested and is load-bearing for the paper's contribution.

Authors: We recognize that the two-layer system is presented as a conceptual proposal derived from our empirical observations, without an implemented prototype or formal evaluation in this work. The primary contribution is the identification of interactional breakdowns through the deployment study and the articulation of design requirements for supporting them. We will revise the design proposal and discussion sections to more explicitly frame the two-layer system as a set of design implications and a call for future socio-technical infrastructure development. Additionally, we will include a section outlining potential evaluation approaches, such as controlled user studies or further deployments, and comparisons to tools like MLflow, Weights & Biases, and collaborative platforms. This ensures the claims about the AI-mediated layer are presented as hypotheses to be tested rather than demonstrated resolutions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observations to design proposal without reductive derivations

full rationale

The paper grounds its central claims in observations from a single 19-month clinical deployment, identifying interactional breakdowns despite traceability, then proposes a two-layer socio-technical system. No equations, fitted parameters, or derivations exist that could reduce predictions to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The argument is self-contained as qualitative HCI research, with the proposed AI-mediated interface presented as a design implication rather than a tautological restatement of the inputs. This matches the default expectation of no circularity for non-mathematical empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on qualitative observations from one deployment and a design proposal; it introduces no free parameters, formal axioms, or new invented entities with independent evidence.

pith-pipeline@v0.9.0 · 5448 in / 1116 out tokens · 45275 ms · 2026-05-10T18:58:15.192890+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300

2019
[2]

Johannes M Bauer and Paulien M Herder. 2009. Designing socio-technical systems. InPhilosophy of technology and engineering sciences. Elsevier, 601–630

2009
[3]

Andrew L Beam, Arjun K Manrai, and Marzyeh Ghassemi. 2020. Challenges to the reproducibility of machine learning models in health care.JAMA323, 4 (2020), 305–306

2020
[4]

Alejandro Bugacov, Karl Czajkowski, Carl Kesselman, Anoop Kumar, Robert E Schuler, and Hongsuda Tangmunarunkit. 2017. Experiences with DERIVA: An asset management platform for accelerating eScience. In2017 IEEE 13th Interna- tional Conference on e-Science (e-Science). IEEE, 79–88

2017
[5]

Jose Armando Hernandez and Miguel Colom. 2025. Reproducible research poli- cies and software/data management in scientific computing journals: a survey, discussion, and perspectives.Frontiers in Computer Science6 (2025), 1491823

2025
[6]

Sayash Kapoor and Arvind Narayanan. 2023. Leakage and the reproducibility crisis in machine-learning-based science.Patterns4, 9 (2023)

2023
[7]

Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2023. Machine learning operations (mlops): Overview, definition, and architecture.IEEE access11 (2023), 31866–31879

2023
[8]

Zhiwei Li, Carl Kesselman, Tran Huy Nguyen, Benjamin Yixing Xu, Kyle Bolo, and Kimberley Yu. 2025. From Data to Decision: Data-Centric Infrastructure for Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning CHI EA ’26, April 13–17, 2026, Barcelona, Spain Reproducible ML in Collaborative eScience. In2025 IEEE Internati...

work page doi:10.1109/escience65000.2025.00030 2025
[9]

Matthew BA McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Luca Foschini, and Marzyeh Ghassemi. 2021. Reproducibility in machine learning for health research: Still a ways to go.Science Translational Medicine13, 586 (2021), eabb1655

2021
[10]

Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner. 2022. Collabora- tion challenges in building ml-enabled systems: Communication, documentation, engineering, and process. InProceedings of the 44th international conference on software engineering. 413–425

2022
[11]

Van Nguyen, Sreenidhi Iyengar, Haroon Rasheed, Galo Apolo, Zhiwei Li, Aniket Kumar, Hong Nguyen, Austin Bohner, Kyle Bolo, Rahul Dhodapkar, et al. 2025. Comparison of Deep Learning and Clinician Performance for Detecting Referable Glaucoma from Fundus Photographs in a Safety Net Population.Ophthalmology Science5, 4 (2025), 100751

2025
[12]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot.arXiv preprint arXiv:2302.06590(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

David Piorkowski, Soya Park, April Yi Wang, Dakuo Wang, Michael Muller, and Felix Portnoy. 2021. How ai developers overcome communication challenges in a multidisciplinary team: A case study.Proceedings of the ACM on human-computer interaction5, CSCW1 (2021), 1–25

2021
[14]

Tom Preston-Werner. 2013. Semantic Versioning 2.0.0. https://semver.org. Ac- cessed January 22, 2026

2013
[15]

Robert Schuler, Karl Czajkowski, Mike D’Arcy, Hongsuda Tangmunarunkit, and Carl Kesselman. 2020. Towards Co-Evolution of Data-Centric Ecosystems. In Proceedings of the 32nd International Conference on Scientific and Statistical Data- base Management(Vienna, Austria)(SSDBM ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 12 pages. d...

work page doi:10.1145/3400903.3400908 2020
[16]

Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002

2025
[17]

Agnia Sergeyuk, Ilya Zakharov, Ekaterina Koshchenko, and Maliheh Izadi. 2026. Human-AI experience in integrated development environments: a systematic literature review.Empirical Software Engineering31, 3 (2026), 55

2026
[18]

Rohith Sothilingam, Vik Pant, and SK Eric. 2022. Using i* to Analyze Collaboration Challenges in MLOps Project Teams. InIStar. 1–6

2022