Recognition: 2 theorem links
· Lean TheoremTowards Improving the External Validity of Software Engineering Experiments with Transportability Methods
Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3
The pith
Transportability methods can combine observational and experimental data to generalize SE experiment results to more representative populations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transportability methods can be applied in software engineering to transport causal effects estimated in a controlled experiment on a non-representative sample to a different target population by leveraging observational data on selection mechanisms and other relevant variables, provided the causal structure is known and the transport formula is identifiable.
What carries the argument
Transportability methods, which use causal diagrams to derive formulas that adjust experimental results for differences between the study sample and the target population using observational data.
If this is right
- Researchers can design experiments with convenient samples and still claim applicability to industry populations when suitable observational data exist.
- Existing repository and log data become directly usable to strengthen the external validity of new controlled experiments.
- Common SE scenarios such as student-versus-professional substitution become formally addressable rather than left to informal judgment.
- The field can produce results that are both statistically valid within the sample and transportable to practice without always requiring larger or more expensive experiments.
Where Pith is reading between the lines
- Adoption would require SE researchers to document selection variables and population descriptors more systematically in both experimental and observational studies.
- The approach could extend to meta-analyses that transport effects across multiple prior experiments by pooling their observational covariates.
- Tool support for drawing causal diagrams and checking identifiability conditions could lower the barrier for routine use in empirical SE papers.
Load-bearing premise
SE researchers can correctly identify the causal structures, selection variables, and population differences needed to apply the transportability formulas with the available observational data.
What would settle it
A follow-up study on a truly representative developer sample that measures the actual effect size and finds it differs from the size predicted by applying transportability to an earlier student-based experiment on the same question.
Figures
read the original abstract
Controlled experiments are a core research method in software engineering (SE) for validating causal claims. However, recruiting a sample of participants that represents the intended target population is often difficult or expensive, which limits the external validity of experimental results. At the same time, SE researchers often have access to much larger amounts of observational than experimental data (e.g., from repositories, issue trackers, logs, surveys and industrial processes). Transportability methods combine these data from experimental and observational studies to "transport" results from the experimental sample to a broader, more representative sample of the target population. Although the ability to combine observational and experimental data in a principled way could substantially benefit empirical SE research, transportability methods have-to our knowledge-not been adopted in SE. In this vision, we aim to help make that adoption possible. To that end, we introduce transportability methods and their prerequisites, and demonstrate their potential through a simulation. We then outline several SE research scenarios in which these methods could apply, e.g., how to effectively use students as substitutes for developers. Finally, we outline a road map and practical guidelines to support SE researchers in applying them. Adopting transportability methods in SE research can strengthen the external validity of controlled experiments and help the field produce results that are both more reliable and more useful in practice
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes adopting transportability methods from causal inference to improve the external validity of software engineering experiments. It introduces the methods and their prerequisites, demonstrates potential via a simulation that combines experimental and observational data, outlines application scenarios (e.g., transporting student-experiment results to professional developers), and provides a roadmap with practical guidelines for SE researchers.
Significance. If the methods prove applicable, they could allow SE researchers to generalize experimental findings more reliably to target populations by leveraging abundant observational data from repositories, logs, and surveys, addressing a core limitation in empirical SE. The simulation serves as a proof-of-concept under idealized conditions, and the roadmap offers concrete next steps; these elements strengthen the vision's utility if the feasibility concerns are addressed.
major comments (1)
- [Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.
minor comments (2)
- [Introduction] The introduction could more explicitly list the data requirements (e.g., need for a causal diagram encoding population differences) before the simulation to improve accessibility for SE readers unfamiliar with causal inference.
- [Roadmap section] Roadmap section: consider adding a brief discussion of sensitivity analysis for violated assumptions, as this would directly support the practical guidelines.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our vision paper. We address the major comment below and will incorporate revisions to strengthen the discussion of practical challenges.
read point-by-point responses
-
Referee: [Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.
Authors: The simulation in Section 4 is explicitly designed as a proof-of-concept under idealized conditions (known graph and satisfied assumptions) to demonstrate the potential benefits of combining experimental and observational data, which is a standard practice when introducing causal inference methods to a new domain like SE. The paper's central claim is not that transportability methods are immediately deployable but that they offer a promising direction for improving external validity, supported by a roadmap for future work. We agree that recovering causal graphs and verifying assumptions (e.g., conditional ignorability) from real SE observational sources is challenging due to unmeasured confounders like developer experience or motivation. To address this, we will revise the manuscript to add explicit discussion in the simulation section and roadmap on these practical difficulties, including strategies such as sensitivity analysis for unmeasured confounding and the need for richer observational datasets. This will better frame the assumptions while preserving the vision. revision: partial
Circularity Check
No circularity; proposal applies external causal inference methods to SE without self-referential reduction.
full rationale
The paper is a vision piece that introduces transportability methods from the existing causal inference literature (e.g., Pearl et al.), demonstrates their use via a simulation under known conditions, and outlines application scenarios and guidelines for SE researchers. No derivation chain reduces a claimed result to a fitted parameter, self-defined quantity, or load-bearing self-citation; the central proposal is an adoption argument whose content remains independent of the paper's own inputs or prior author work on the same topic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Observational data in SE can be used to identify selection variables and causal structures for transportability.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Transport with reweighting: IPSW estimator... Plug-in g-formula... preconditions A5–A7
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
covariate shift... treatment effect modifier X... Figure 1 DAG
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: A critical review and guidelines.Empirical Software Engineering27, 4 (2022), 94. doi:10.1007/s10664-021-10072-8
-
[2]
Jeffrey Carver, Letizia Jaccheri, Sandro Morasca, and Forrest Shull. 2004. Issues in using students in empirical studies in software engineering education. In Proceedings. 5th international workshop on enterprise networking and computing in healthcare industry (IEEE Cat. No. 03EX717). IEEE, 239–249. doi:10.1109/METR IC.2003.1232471
-
[3]
Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. 2024. Causal inference methods for combining randomized trials and observational studies: a review. Statistical science39, 1 (2024), 165–191. doi:10.1214/23-STS889
-
[4]
Bill Curtis. 1986. By the way, did anyone study any real programmers?. InPapers presented at the first workshop on empirical studies of programmers on Empirical studies of programmers. 256–262. doi:10.5555/21842.28899
-
[5]
Oscar Dieste, Natalia Juristo, and Mauro Danilo Martínez. 2013. Software industry experiments: A systematic literature review. In2013 1st International Workshop on Conducting Empirical Studies in Industry (CESI). IEEE, 2–8. doi:10.1109/CESI.2 013.6618462
-
[6]
Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments.Information and Software Technology48, 8 (2006), 745–755. doi:10.1016/j.infsof.2005.08.009
-
[7]
Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, Andreas Jedlitschka, and Markku Oivo. 2018. Empirical software engineering experts on the use of students and professionals in experiments.Empirical Software Engineering23, 1 (2018), 452–489. doi:10.1007/s10664-017-9523-3
-
[8]
Robert Feldt, Thomas Zimmermann, Gunnar R Bergersen, Davide Falessi, Andreas Jedlitschka, Natalia Juristo, Jürgen Münch, Markku Oivo, Per Runeson, Martin Shepperd, et al. 2018. Four commentaries on the use of students and professionals in empirical software engineering experiments.Empirical Software Engineering 23, 6 (2018), 3801–3820
2018
-
[9]
Julian Frattini, Davide Fucci, Richard Torkar, Lloyd Montgomery, Michael Un- terkalmsteiner, Jannik Fischbach, and Daniel Mendez. 2025. Applying bayesian data analysis for causal inference about requirements quality: a controlled exper- iment.Empirical Software Engineering30, 1 (2025), 29. doi:10.1007/s10664-024- 10582-1
-
[10]
Julian Frattini, Richard Torkar, Robert Feldt, and Carlo Furia. 2026. Replication Package. https://doi.org/10.5281/zenodo.19451793. Last accessed 2026-04-07
-
[11]
Tobias Hey, Jan Keim, and Sophie Corallo. 2024. Requirements classification for traceability link recovery. In2024 IEEE 32nd International Requirements Engineer- ing Conference (RE). IEEE, 155–167. doi:10.1109/RE59067.2024.00024
-
[12]
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, Shing-Chi Cheung, Alessandro Or...
-
[13]
Johnson Ching-Hong Li. 2018. Curvilinear moderation—a more complete ex- amination of moderation effects in behavioral sciences.Frontiers in Applied Mathematics and Statistics4 (2018), 7. doi:10.3389/fams.2018.00007
-
[14]
2018.Statistical rethinking: A Bayesian course with examples in R and Stan
Richard McElreath. 2018.Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC. doi:10.1201/9781315372495
-
[15]
Judea Pearl and Elias Bareinboim. 2011. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 25. 247–254
2011
-
[16]
Peter M Rothwell. 2005. External validity of randomised controlled trials:“to whom do the results of this trial apply?”.The Lancet365, 9453 (2005), 82–93. doi:10.1016/S0140-6736(04)17670-8
-
[17]
Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students represen- tatives of professionals in software engineering experiments?. In2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676. doi:10.1109/ICSE.2015.82
-
[18]
Yorick Sens, Henriette Knopp, Sven Peldszus, and Thorsten Berger. 2025. A Large-Scale Study of Model Integration in ML-Enabled Software Systems. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1165–1177. doi:10.1109/ICSE55347.2025.00185
-
[19]
William R Shadish, Thomas D Cook, and Donald T Campbell. 2002.Experimental and quasi-experimental designs for generalized causal inference.Houghton, Mifflin and Company. doi:10.1086/345281
-
[20]
Janet Siegmund and Jana Schumann. 2015. Confounding parameters on program comprehension: a literature survey.Empirical Software Engineering20, 4 (2015), 1159–1192. doi:10.1007/s10664-014-9318-8
-
[21]
Dag IK Sjøberg, Bente Anda, Erik Arisholm, Tore Dybå, Magne Jørgensen, Amela Karahasanović, and Marek Vokáč. 2003. Challenges and recommendations when increasing the realism of controlled software engineering experiments. InEm- pirical Methods and Studies in Software Engineering: Experiences from ESERNET. Springer, 24–38. doi:10.1007/978-3-540-45143-3_3
-
[22]
Dag IK Sjøberg and Gunnar Rye Bergersen. 2022. Construct validity in software engineering.IEEE Transactions on Software Engineering49, 3 (2022), 1374–1396. doi:10.1109/TSE.2022.3176725
-
[23]
Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering.IEEE transactions on software engineering 31, 9 (2005), 733–753. doi:10.1109/TSE.2005.97
-
[24]
Klaas-Jan Stol and Brian Fitzgerald. 2018. The ABC of software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM) 27, 3 (2018), 1–51. doi:10.1145/3241743
-
[25]
Masashi Sugiyama and Motoaki Kawanabe. 2012.Machine learning in non- stationary environments: Introduction to covariate shift adaptation. MIT press. doi:10.7551/mitpress/9780262017091.001.0001
-
[26]
Caroline B Terwee, Cecilia AC Prinsen, Alessandro Chiarotto, Marjan J West- erman, Donald L Patrick, Jordi Alonso, Lex M Bouter, Henrica CW De Vet, and Lidwine B Mokkink. 2018. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study.Quality of life research27, 5 (2018), 1159–1170. doi:10.1007/s11136-018-1829-0
-
[27]
Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, and Gabriele Bavota. 2025. Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1640–1652. doi:10.1109/ICSE55347.2025.00060
-
[28]
Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: A study of intelligence and personality.IEEE Transactions on Software Engineering 48, 12 (2021), 4789–4801. doi:10.1109/TSE.2021.3127131
-
[29]
Michael Waldman. 1984. Worker allocation, hierarchies and the wage distribution. The Review of Economic Studies51, 1 (1984), 95–109. doi:10.2307/2297707
-
[30]
2012.Experimentation in software engineering
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer. doi:10.1007/978-3-662-69306-3
-
[31]
Marvin Wyrich and Sven Apel. 2024. Evidence Tetris in the Pixelated World of Validity Threats. InProceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering. 13–16. doi:10.1145/3643664.3648203 Received 23 January 2026; accepted 2 April 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.