pith. machine review for the scientific record. sign in

arxiv: 2604.08200 · v2 · submitted 2026-04-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Towards Improving the External Validity of Software Engineering Experiments with Transportability Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords external validitytransportabilitycausal inferencecontrolled experimentsobservational datasoftware engineeringempirical research methods
0
0 comments X

The pith

Transportability methods can combine observational and experimental data to generalize SE experiment results to more representative populations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Controlled experiments in software engineering often use non-representative samples such as students because recruiting professional developers is difficult or costly. This limits how well results apply to the broader target population of practitioners. Transportability methods from causal inference provide a principled way to use larger observational datasets from repositories, logs, and surveys to adjust and transport causal findings from the experimental sample to the target population. The paper introduces these methods, demonstrates them in a simulation, maps them to common SE scenarios like substituting students for developers, and supplies a roadmap with practical guidelines for adoption.

Core claim

Transportability methods can be applied in software engineering to transport causal effects estimated in a controlled experiment on a non-representative sample to a different target population by leveraging observational data on selection mechanisms and other relevant variables, provided the causal structure is known and the transport formula is identifiable.

What carries the argument

Transportability methods, which use causal diagrams to derive formulas that adjust experimental results for differences between the study sample and the target population using observational data.

If this is right

  • Researchers can design experiments with convenient samples and still claim applicability to industry populations when suitable observational data exist.
  • Existing repository and log data become directly usable to strengthen the external validity of new controlled experiments.
  • Common SE scenarios such as student-versus-professional substitution become formally addressable rather than left to informal judgment.
  • The field can produce results that are both statistically valid within the sample and transportable to practice without always requiring larger or more expensive experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption would require SE researchers to document selection variables and population descriptors more systematically in both experimental and observational studies.
  • The approach could extend to meta-analyses that transport effects across multiple prior experiments by pooling their observational covariates.
  • Tool support for drawing causal diagrams and checking identifiability conditions could lower the barrier for routine use in empirical SE papers.

Load-bearing premise

SE researchers can correctly identify the causal structures, selection variables, and population differences needed to apply the transportability formulas with the available observational data.

What would settle it

A follow-up study on a truly representative developer sample that measures the actual effect size and finds it differs from the size predicted by applying transportability to an earlier student-based experiment on the same question.

Figures

Figures reproduced from arXiv: 2604.08200 by Carlo A. Furia, Julian Frattini, Richard Torkar, Robert Feldt.

Figure 1
Figure 1. Figure 1: DAG visualizing causal assumptions of the illustra [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results from the simulation compared against the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Next, we simulated the trial eligibility [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Controlled experiments are a core research method in software engineering (SE) for validating causal claims. However, recruiting a sample of participants that represents the intended target population is often difficult or expensive, which limits the external validity of experimental results. At the same time, SE researchers often have access to much larger amounts of observational than experimental data (e.g., from repositories, issue trackers, logs, surveys and industrial processes). Transportability methods combine these data from experimental and observational studies to "transport" results from the experimental sample to a broader, more representative sample of the target population. Although the ability to combine observational and experimental data in a principled way could substantially benefit empirical SE research, transportability methods have-to our knowledge-not been adopted in SE. In this vision, we aim to help make that adoption possible. To that end, we introduce transportability methods and their prerequisites, and demonstrate their potential through a simulation. We then outline several SE research scenarios in which these methods could apply, e.g., how to effectively use students as substitutes for developers. Finally, we outline a road map and practical guidelines to support SE researchers in applying them. Adopting transportability methods in SE research can strengthen the external validity of controlled experiments and help the field produce results that are both more reliable and more useful in practice

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes adopting transportability methods from causal inference to improve the external validity of software engineering experiments. It introduces the methods and their prerequisites, demonstrates potential via a simulation that combines experimental and observational data, outlines application scenarios (e.g., transporting student-experiment results to professional developers), and provides a roadmap with practical guidelines for SE researchers.

Significance. If the methods prove applicable, they could allow SE researchers to generalize experimental findings more reliably to target populations by leveraging abundant observational data from repositories, logs, and surveys, addressing a core limitation in empirical SE. The simulation serves as a proof-of-concept under idealized conditions, and the roadmap offers concrete next steps; these elements strengthen the vision's utility if the feasibility concerns are addressed.

major comments (1)
  1. [Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.
minor comments (2)
  1. [Introduction] The introduction could more explicitly list the data requirements (e.g., need for a causal diagram encoding population differences) before the simulation to improve accessibility for SE readers unfamiliar with causal inference.
  2. [Roadmap section] Roadmap section: consider adding a brief discussion of sensitivity analysis for violated assumptions, as this would directly support the practical guidelines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our vision paper. We address the major comment below and will incorporate revisions to strengthen the discussion of practical challenges.

read point-by-point responses
  1. Referee: [Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.

    Authors: The simulation in Section 4 is explicitly designed as a proof-of-concept under idealized conditions (known graph and satisfied assumptions) to demonstrate the potential benefits of combining experimental and observational data, which is a standard practice when introducing causal inference methods to a new domain like SE. The paper's central claim is not that transportability methods are immediately deployable but that they offer a promising direction for improving external validity, supported by a roadmap for future work. We agree that recovering causal graphs and verifying assumptions (e.g., conditional ignorability) from real SE observational sources is challenging due to unmeasured confounders like developer experience or motivation. To address this, we will revise the manuscript to add explicit discussion in the simulation section and roadmap on these practical difficulties, including strategies such as sensitivity analysis for unmeasured confounding and the need for richer observational datasets. This will better frame the assumptions while preserving the vision. revision: partial

Circularity Check

0 steps flagged

No circularity; proposal applies external causal inference methods to SE without self-referential reduction.

full rationale

The paper is a vision piece that introduces transportability methods from the existing causal inference literature (e.g., Pearl et al.), demonstrates their use via a simulation under known conditions, and outlines application scenarios and guidelines for SE researchers. No derivation chain reduces a claimed result to a fitted parameter, self-defined quantity, or load-bearing self-citation; the central proposal is an adoption argument whose content remains independent of the paper's own inputs or prior author work on the same topic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that observational SE data can be used to specify the necessary causal models for transportability; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Observational data in SE can be used to identify selection variables and causal structures for transportability.
    Invoked when outlining prerequisites and SE scenarios such as using students as substitutes for developers.

pith-pipeline@v0.9.0 · 5539 in / 1265 out tokens · 63442 ms · 2026-05-10T17:49:08.433416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 29 canonical work pages

  1. [1]

    Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: A critical review and guidelines.Empirical Software Engineering27, 4 (2022), 94. doi:10.1007/s10664-021-10072-8

  2. [2]

    Jeffrey Carver, Letizia Jaccheri, Sandro Morasca, and Forrest Shull. 2004. Issues in using students in empirical studies in software engineering education. In Proceedings. 5th international workshop on enterprise networking and computing in healthcare industry (IEEE Cat. No. 03EX717). IEEE, 239–249. doi:10.1109/METR IC.2003.1232471

  3. [3]

    Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. 2024. Causal inference methods for combining randomized trials and observational studies: a review. Statistical science39, 1 (2024), 165–191. doi:10.1214/23-STS889

  4. [4]

    Bill Curtis. 1986. By the way, did anyone study any real programmers?. InPapers presented at the first workshop on empirical studies of programmers on Empirical studies of programmers. 256–262. doi:10.5555/21842.28899

  5. [5]

    Oscar Dieste, Natalia Juristo, and Mauro Danilo Martínez. 2013. Software industry experiments: A systematic literature review. In2013 1st International Workshop on Conducting Empirical Studies in Industry (CESI). IEEE, 2–8. doi:10.1109/CESI.2 013.6618462

  6. [6]

    Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments.Information and Software Technology48, 8 (2006), 745–755. doi:10.1016/j.infsof.2005.08.009

  7. [7]

    Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, Andreas Jedlitschka, and Markku Oivo. 2018. Empirical software engineering experts on the use of students and professionals in experiments.Empirical Software Engineering23, 1 (2018), 452–489. doi:10.1007/s10664-017-9523-3

  8. [8]

    Robert Feldt, Thomas Zimmermann, Gunnar R Bergersen, Davide Falessi, Andreas Jedlitschka, Natalia Juristo, Jürgen Münch, Markku Oivo, Per Runeson, Martin Shepperd, et al. 2018. Four commentaries on the use of students and professionals in empirical software engineering experiments.Empirical Software Engineering 23, 6 (2018), 3801–3820

  9. [9]

    Julian Frattini, Davide Fucci, Richard Torkar, Lloyd Montgomery, Michael Un- terkalmsteiner, Jannik Fischbach, and Daniel Mendez. 2025. Applying bayesian data analysis for causal inference about requirements quality: a controlled exper- iment.Empirical Software Engineering30, 1 (2025), 29. doi:10.1007/s10664-024- 10582-1

  10. [10]

    Julian Frattini, Richard Torkar, Robert Feldt, and Carlo Furia. 2026. Replication Package. https://doi.org/10.5281/zenodo.19451793. Last accessed 2026-04-07

  11. [11]

    Tobias Hey, Jan Keim, and Sophie Corallo. 2024. Requirements classification for traceability link recovery. In2024 IEEE 32nd International Requirements Engineer- ing Conference (RE). IEEE, 155–167. doi:10.1109/RE59067.2024.00024

  12. [12]

    In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

    René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, Shing-Chi Cheung, Alessandro Or...

  13. [13]

    Johnson Ching-Hong Li. 2018. Curvilinear moderation—a more complete ex- amination of moderation effects in behavioral sciences.Frontiers in Applied Mathematics and Statistics4 (2018), 7. doi:10.3389/fams.2018.00007

  14. [14]

    2018.Statistical rethinking: A Bayesian course with examples in R and Stan

    Richard McElreath. 2018.Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC. doi:10.1201/9781315372495

  15. [15]

    Judea Pearl and Elias Bareinboim. 2011. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 25. 247–254

  16. [16]

    Rothwell

    Peter M Rothwell. 2005. External validity of randomised controlled trials:“to whom do the results of this trial apply?”.The Lancet365, 9453 (2005), 82–93. doi:10.1016/S0140-6736(04)17670-8

  17. [17]

    Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students represen- tatives of professionals in software engineering experiments?. In2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676. doi:10.1109/ICSE.2015.82

  18. [18]

    Yorick Sens, Henriette Knopp, Sven Peldszus, and Thorsten Berger. 2025. A Large-Scale Study of Model Integration in ML-Enabled Software Systems. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1165–1177. doi:10.1109/ICSE55347.2025.00185

  19. [19]

    2002.Experimental and quasi-experimental designs for generalized causal inference.Houghton, Mifflin and Company

    William R Shadish, Thomas D Cook, and Donald T Campbell. 2002.Experimental and quasi-experimental designs for generalized causal inference.Houghton, Mifflin and Company. doi:10.1086/345281

  20. [20]

    Janet Siegmund and Jana Schumann. 2015. Confounding parameters on program comprehension: a literature survey.Empirical Software Engineering20, 4 (2015), 1159–1192. doi:10.1007/s10664-014-9318-8

  21. [21]

    Dag IK Sjøberg, Bente Anda, Erik Arisholm, Tore Dybå, Magne Jørgensen, Amela Karahasanović, and Marek Vokáč. 2003. Challenges and recommendations when increasing the realism of controlled software engineering experiments. InEm- pirical Methods and Studies in Software Engineering: Experiences from ESERNET. Springer, 24–38. doi:10.1007/978-3-540-45143-3_3

  22. [22]

    Dag IK Sjøberg and Gunnar Rye Bergersen. 2022. Construct validity in software engineering.IEEE Transactions on Software Engineering49, 3 (2022), 1374–1396. doi:10.1109/TSE.2022.3176725

  23. [23]

    Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering.IEEE transactions on software engineering 31, 9 (2005), 733–753. doi:10.1109/TSE.2005.97

  24. [24]

    Klaas-Jan Stol and Brian Fitzgerald. 2018. The ABC of software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM) 27, 3 (2018), 1–51. doi:10.1145/3241743

  25. [25]

    2013.Gameworld Interfaces

    Masashi Sugiyama and Motoaki Kawanabe. 2012.Machine learning in non- stationary environments: Introduction to covariate shift adaptation. MIT press. doi:10.7551/mitpress/9780262017091.001.0001

  26. [26]

    Caroline B Terwee, Cecilia AC Prinsen, Alessandro Chiarotto, Marjan J West- erman, Donald L Patrick, Jordi Alonso, Lex M Bouter, Henrica CW De Vet, and Lidwine B Mokkink. 2018. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study.Quality of life research27, 5 (2018), 1159–1170. doi:10.1007/s11136-018-1829-0

  27. [27]

    Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, and Gabriele Bavota. 2025. Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1640–1652. doi:10.1109/ICSE55347.2025.00060

  28. [28]

    Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: A study of intelligence and personality.IEEE Transactions on Software Engineering 48, 12 (2021), 4789–4801. doi:10.1109/TSE.2021.3127131

  29. [29]

    Michael Waldman. 1984. Worker allocation, hierarchies and the wage distribution. The Review of Economic Studies51, 1 (1984), 95–109. doi:10.2307/2297707

  30. [30]

    2012.Experimentation in software engineering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer. doi:10.1007/978-3-662-69306-3

  31. [31]

    Marvin Wyrich and Sven Apel. 2024. Evidence Tetris in the Pixelated World of Validity Threats. InProceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering. 13–16. doi:10.1145/3643664.3648203 Received 23 January 2026; accepted 2 April 2026