Recognition: 2 theorem links
· Lean TheoremReproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning
Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3
The pith
Reproducibility in collaborative ML fails when teams cannot interpret prior work or align evolving components, even with complete artifacts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Observations from a 19-month deployment of a data-centric ML management system in clinical research reveal recurring interactional breakdowns that continue even when structural traceability of artifacts is comprehensive. The central proposal is a two-layer socio-technical ML management system that combines lifecycle-aware artifact infrastructure with an interactional layer to mediate coordination, explanation, and shared understanding. An AI-mediated semantic interface is positioned to help participants reconstruct experimental intent over time, reframing reproducibility as a dynamic socio-technical accomplishment rather than a static feature of recorded traces.
What carries the argument
A two-layer socio-technical ML management system, consisting of a lifecycle-aware artifact infrastructure layer paired with an AI-mediated interactional layer that supports coordination, explanation, and shared understanding.
If this is right
- Reproducibility is treated as an ongoing socio-technical process rather than a one-time property of stored artifacts.
- AI-mediated semantic interfaces can assist teams in aligning evolving components and recovering experimental intent.
- ML management systems should incorporate dedicated layers for interactional support alongside technical traceability features.
- Interdisciplinary projects gain mechanisms for maintaining shared understanding across changing team compositions and component versions.
- Design of human-centered ML infrastructure prioritizes mediation of coordination and explanation needs.
Where Pith is reading between the lines
- The same interactional issues could appear in collaborative projects outside clinical settings, such as in scientific computing or engineering teams.
- The proposed AI mediation layer might integrate with existing experiment-tracking platforms to reduce overhead in large groups without requiring full system replacement.
- Testing the system in shorter or smaller-scale projects could reveal whether the benefits scale down or depend on long-duration team dynamics.
- Extending the approach beyond ML could address reproducibility challenges in other data-intensive collaborative research fields.
Load-bearing premise
The interactional breakdowns documented in one 19-month clinical ML deployment are representative of collaborative ML projects in general and an added AI-mediated interactional layer would resolve them in practice.
What would settle it
A follow-up deployment of the two-layer system in another collaborative ML project that shows no measurable reduction in coordination breakdowns or failed attempts to reconstruct intent, or an equivalent project that achieves full reproducibility using only artifact-complete systems without interactional support.
Figures
read the original abstract
Machine learning (ML) reproducibility is often framed as a problem of incomplete artifact recording. This framing leads to systems that prioritize capturing datasets, code, configurations, and execution environments. However, in collaborative and interdisciplinary ML projects, reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, we identify recurring interactional breakdowns that persist despite comprehensive structural traceability. Based on these findings, we propose a two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer designed to mediate coordination, explanation, and shared understanding. We discuss how an AI-mediated semantic interface reframes reproducibility as an ongoing socio-technical accomplishment rather than a static property of recorded traces, and outline implications for human-centered ML infrastructure design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ML reproducibility in collaborative, interdisciplinary settings often fails due to interactional issues—interpreting prior work, aligning evolving components, and reconstructing experimental intent—beyond incomplete artifacts. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, it identifies persistent breakdowns despite structural traceability. The authors propose a two-layer socio-technical system (lifecycle-aware artifact infrastructure plus an AI-mediated interactional layer) to support coordination, explanation, and shared understanding, reframing reproducibility as an ongoing socio-technical accomplishment with implications for human-centered ML infrastructure.
Significance. If the observations hold more broadly and the proposed mediation layer proves effective, the work could meaningfully shift reproducibility research from artifact-centric recording toward socio-technical support in HCI and ML systems. The empirical grounding in a longitudinal real-world deployment is a clear strength, providing concrete examples of breakdowns that current tools do not address.
major comments (2)
- [Deployment study and findings] The central generalization that interactional breakdowns 'often' cause reproducibility failures (abstract and introduction) rests on observations from a single 19-month clinical deployment. No comparative data from other team types, domains, or project scales are presented, leaving the prevalence and representativeness unestablished. This directly affects the motivation for the two-layer proposal.
- [Proposed system and discussion] The two-layer socio-technical system (artifact infrastructure plus AI-mediated interactional layer) is presented conceptually in the design proposal section with no prototype implementation, evaluation, or comparison against existing tools. The claim that the AI-mediated semantic interface would resolve the observed coordination and intent-reconstruction issues therefore remains untested and is load-bearing for the paper's contribution.
minor comments (1)
- [Discussion] The transition from specific deployment observations to the proposed design implications could be made more explicit with a table or diagram mapping breakdowns to system features.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the empirical grounding and for highlighting areas where we can strengthen the manuscript. We address the two major comments below, proposing targeted revisions to qualify our claims and clarify the scope of the proposed system.
read point-by-point responses
-
Referee: The central generalization that interactional breakdowns 'often' cause reproducibility failures (abstract and introduction) rests on observations from a single 19-month clinical deployment. No comparative data from other team types, domains, or project scales are presented, leaving the prevalence and representativeness unestablished. This directly affects the motivation for the two-layer proposal.
Authors: We agree that our findings are drawn from a single 19-month deployment in a clinical research context, which precludes strong claims about prevalence across all collaborative ML settings. The paper positions this as an in-depth case study typical in HCI research to uncover nuanced interactional issues that may be overlooked in broader surveys. To address this, we will revise the abstract, introduction, and discussion sections to temper the language from 'often' to 'can frequently' or 'in collaborative interdisciplinary settings such as the one studied,' and explicitly note the limitations of generalizability. We will also add a dedicated subsection on study limitations and suggest directions for future comparative research. This revision preserves the motivation for the two-layer system, as the observed breakdowns provide specific, actionable insights into where current artifact-based approaches fall short. revision: partial
-
Referee: The two-layer socio-technical system (artifact infrastructure plus AI-mediated interactional layer) is presented conceptually in the design proposal section with no prototype implementation, evaluation, or comparison against existing tools. The claim that the AI-mediated semantic interface would resolve the observed coordination and intent-reconstruction issues therefore remains untested and is load-bearing for the paper's contribution.
Authors: We recognize that the two-layer system is presented as a conceptual proposal derived from our empirical observations, without an implemented prototype or formal evaluation in this work. The primary contribution is the identification of interactional breakdowns through the deployment study and the articulation of design requirements for supporting them. We will revise the design proposal and discussion sections to more explicitly frame the two-layer system as a set of design implications and a call for future socio-technical infrastructure development. Additionally, we will include a section outlining potential evaluation approaches, such as controlled user studies or further deployments, and comparisons to tools like MLflow, Weights & Biases, and collaborative platforms. This ensures the claims about the AI-mediated layer are presented as hypotheses to be tested rather than demonstrated resolutions. revision: partial
Circularity Check
No circularity: empirical observations to design proposal without reductive derivations
full rationale
The paper grounds its central claims in observations from a single 19-month clinical deployment, identifying interactional breakdowns despite traceability, then proposes a two-layer socio-technical system. No equations, fitted parameters, or derivations exist that could reduce predictions to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The argument is self-contained as qualitative HCI research, with the proposed AI-mediated interface presented as a design implication rather than a tautological restatement of the inputs. This matches the default expectation of no circularity for non-mathematical empirical papers.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300
2019
-
[2]
Johannes M Bauer and Paulien M Herder. 2009. Designing socio-technical systems. InPhilosophy of technology and engineering sciences. Elsevier, 601–630
2009
-
[3]
Andrew L Beam, Arjun K Manrai, and Marzyeh Ghassemi. 2020. Challenges to the reproducibility of machine learning models in health care.JAMA323, 4 (2020), 305–306
2020
-
[4]
Alejandro Bugacov, Karl Czajkowski, Carl Kesselman, Anoop Kumar, Robert E Schuler, and Hongsuda Tangmunarunkit. 2017. Experiences with DERIVA: An asset management platform for accelerating eScience. In2017 IEEE 13th Interna- tional Conference on e-Science (e-Science). IEEE, 79–88
2017
-
[5]
Jose Armando Hernandez and Miguel Colom. 2025. Reproducible research poli- cies and software/data management in scientific computing journals: a survey, discussion, and perspectives.Frontiers in Computer Science6 (2025), 1491823
2025
-
[6]
Sayash Kapoor and Arvind Narayanan. 2023. Leakage and the reproducibility crisis in machine-learning-based science.Patterns4, 9 (2023)
2023
-
[7]
Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2023. Machine learning operations (mlops): Overview, definition, and architecture.IEEE access11 (2023), 31866–31879
2023
-
[8]
Zhiwei Li, Carl Kesselman, Tran Huy Nguyen, Benjamin Yixing Xu, Kyle Bolo, and Kimberley Yu. 2025. From Data to Decision: Data-Centric Infrastructure for Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning CHI EA ’26, April 13–17, 2026, Barcelona, Spain Reproducible ML in Collaborative eScience. In2025 IEEE Internati...
-
[9]
Matthew BA McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Luca Foschini, and Marzyeh Ghassemi. 2021. Reproducibility in machine learning for health research: Still a ways to go.Science Translational Medicine13, 586 (2021), eabb1655
2021
-
[10]
Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner. 2022. Collabora- tion challenges in building ml-enabled systems: Communication, documentation, engineering, and process. InProceedings of the 44th international conference on software engineering. 413–425
2022
-
[11]
Van Nguyen, Sreenidhi Iyengar, Haroon Rasheed, Galo Apolo, Zhiwei Li, Aniket Kumar, Hong Nguyen, Austin Bohner, Kyle Bolo, Rahul Dhodapkar, et al. 2025. Comparison of Deep Learning and Clinician Performance for Detecting Referable Glaucoma from Fundus Photographs in a Safety Net Population.Ophthalmology Science5, 4 (2025), 100751
2025
-
[12]
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot.arXiv preprint arXiv:2302.06590(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
David Piorkowski, Soya Park, April Yi Wang, Dakuo Wang, Michael Muller, and Felix Portnoy. 2021. How ai developers overcome communication challenges in a multidisciplinary team: A case study.Proceedings of the ACM on human-computer interaction5, CSCW1 (2021), 1–25
2021
-
[14]
Tom Preston-Werner. 2013. Semantic Versioning 2.0.0. https://semver.org. Ac- cessed January 22, 2026
2013
-
[15]
Robert Schuler, Karl Czajkowski, Mike D’Arcy, Hongsuda Tangmunarunkit, and Carl Kesselman. 2020. Towards Co-Evolution of Data-Centric Ecosystems. In Proceedings of the 32nd International Conference on Scientific and Statistical Data- base Management(Vienna, Austria)(SSDBM ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 12 pages. d...
-
[16]
Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002
2025
-
[17]
Agnia Sergeyuk, Ilya Zakharov, Ekaterina Koshchenko, and Maliheh Izadi. 2026. Human-AI experience in integrated development environments: a systematic literature review.Empirical Software Engineering31, 3 (2026), 55
2026
-
[18]
Rohith Sothilingam, Vik Pant, and SK Eric. 2022. Using i* to Analyze Collaboration Challenges in MLOps Project Teams. InIStar. 1–6
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.