arxiv: 2604.08200 · v2 · submitted 2026-04-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Towards Improving the External Validity of Software Engineering Experiments with Transportability Methods

Julian Frattini , Richard Torkar , Robert Feldt , Carlo A. Furia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords external validitytransportabilitycausal inferencecontrolled experimentsobservational datasoftware engineeringempirical research methods

0 comments

The pith

Transportability methods can combine observational and experimental data to generalize SE experiment results to more representative populations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Controlled experiments in software engineering often use non-representative samples such as students because recruiting professional developers is difficult or costly. This limits how well results apply to the broader target population of practitioners. Transportability methods from causal inference provide a principled way to use larger observational datasets from repositories, logs, and surveys to adjust and transport causal findings from the experimental sample to the target population. The paper introduces these methods, demonstrates them in a simulation, maps them to common SE scenarios like substituting students for developers, and supplies a roadmap with practical guidelines for adoption.

Core claim

Transportability methods can be applied in software engineering to transport causal effects estimated in a controlled experiment on a non-representative sample to a different target population by leveraging observational data on selection mechanisms and other relevant variables, provided the causal structure is known and the transport formula is identifiable.

What carries the argument

Transportability methods, which use causal diagrams to derive formulas that adjust experimental results for differences between the study sample and the target population using observational data.

If this is right

Researchers can design experiments with convenient samples and still claim applicability to industry populations when suitable observational data exist.
Existing repository and log data become directly usable to strengthen the external validity of new controlled experiments.
Common SE scenarios such as student-versus-professional substitution become formally addressable rather than left to informal judgment.
The field can produce results that are both statistically valid within the sample and transportable to practice without always requiring larger or more expensive experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption would require SE researchers to document selection variables and population descriptors more systematically in both experimental and observational studies.
The approach could extend to meta-analyses that transport effects across multiple prior experiments by pooling their observational covariates.
Tool support for drawing causal diagrams and checking identifiability conditions could lower the barrier for routine use in empirical SE papers.

Load-bearing premise

SE researchers can correctly identify the causal structures, selection variables, and population differences needed to apply the transportability formulas with the available observational data.

What would settle it

A follow-up study on a truly representative developer sample that measures the actual effect size and finds it differs from the size predicted by applying transportability to an earlier student-based experiment on the same question.

Figures

Figures reproduced from arXiv: 2604.08200 by Carlo A. Furia, Julian Frattini, Richard Torkar, Robert Feldt.

**Figure 3.** Figure 3: Results from the simulation compared against the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: Next, we simulated the trial eligibility [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Controlled experiments are a core research method in software engineering (SE) for validating causal claims. However, recruiting a sample of participants that represents the intended target population is often difficult or expensive, which limits the external validity of experimental results. At the same time, SE researchers often have access to much larger amounts of observational than experimental data (e.g., from repositories, issue trackers, logs, surveys and industrial processes). Transportability methods combine these data from experimental and observational studies to "transport" results from the experimental sample to a broader, more representative sample of the target population. Although the ability to combine observational and experimental data in a principled way could substantially benefit empirical SE research, transportability methods have-to our knowledge-not been adopted in SE. In this vision, we aim to help make that adoption possible. To that end, we introduce transportability methods and their prerequisites, and demonstrate their potential through a simulation. We then outline several SE research scenarios in which these methods could apply, e.g., how to effectively use students as substitutes for developers. Finally, we outline a road map and practical guidelines to support SE researchers in applying them. Adopting transportability methods in SE research can strengthen the external validity of controlled experiments and help the field produce results that are both more reliable and more useful in practice

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This vision paper introduces transportability methods from causal inference to SE experiments for the first time and shows their potential via simulation and SE-specific scenarios, but leaves the key practicality question untested.

read the letter

The punchline is that transportability offers a principled way to combine experimental and observational SE data to improve external validity, and this paper is the first to map that onto our field with examples and a roadmap. It does a clean job laying out the core problem—hard-to-represent samples in controlled experiments—and explains how transportability formulas could adjust results using larger observational sources like repos and logs. The simulation illustrates the mechanics under ideal conditions, and the scenarios, such as transporting student outcomes to professional developers, feel relevant. The guidelines section gives researchers something concrete to try next. That is useful framing even if it is not a new derivation. The main limitation is that the approach depends on researchers being able to recover accurate causal graphs and selection variables from typical SE observational data. The paper demonstrates the method only when those structures are already known; it does not check whether real repository, log, or survey data contain the necessary variables or allow reliable graph recovery. Unmeasured factors like experience or motivation, common in student-to-professional transport, would violate the assumptions and bias the adjustments. Because this is a vision piece with no real SE dataset validation, the central claim stays plausible but unproven in practice. This is for empirical SE researchers who already care about causal generalization and want pointers to existing methods. A reader looking for new tools or a starting point for follow-up work would get value from it. It deserves a serious referee because it identifies a genuine gap and connects it to established literature without circular claims. I would send it to peer review with the expectation that reviewers focus on whether the causal modeling step is realistic for SE data sources.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes adopting transportability methods from causal inference to improve the external validity of software engineering experiments. It introduces the methods and their prerequisites, demonstrates potential via a simulation that combines experimental and observational data, outlines application scenarios (e.g., transporting student-experiment results to professional developers), and provides a roadmap with practical guidelines for SE researchers.

Significance. If the methods prove applicable, they could allow SE researchers to generalize experimental findings more reliably to target populations by leveraging abundant observational data from repositories, logs, and surveys, addressing a core limitation in empirical SE. The simulation serves as a proof-of-concept under idealized conditions, and the roadmap offers concrete next steps; these elements strengthen the vision's utility if the feasibility concerns are addressed.

major comments (1)

[Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.

minor comments (2)

[Introduction] The introduction could more explicitly list the data requirements (e.g., need for a causal diagram encoding population differences) before the simulation to improve accessibility for SE readers unfamiliar with causal inference.
[Roadmap section] Roadmap section: consider adding a brief discussion of sensitivity analysis for violated assumptions, as this would directly support the practical guidelines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our vision paper. We address the major comment below and will incorporate revisions to strengthen the discussion of practical challenges.

read point-by-point responses

Referee: [Simulation section] Simulation section: the demonstration assumes known causal graphs, selection variables S, and satisfaction of transportability assumptions (e.g., conditional ignorability). It does not test or discuss recovery of these structures from typical SE observational sources, which often lack variables for unmeasured confounders such as developer experience or motivation; this assumption is load-bearing for the central claim that the methods can be adopted to strengthen external validity in practice.

Authors: The simulation in Section 4 is explicitly designed as a proof-of-concept under idealized conditions (known graph and satisfied assumptions) to demonstrate the potential benefits of combining experimental and observational data, which is a standard practice when introducing causal inference methods to a new domain like SE. The paper's central claim is not that transportability methods are immediately deployable but that they offer a promising direction for improving external validity, supported by a roadmap for future work. We agree that recovering causal graphs and verifying assumptions (e.g., conditional ignorability) from real SE observational sources is challenging due to unmeasured confounders like developer experience or motivation. To address this, we will revise the manuscript to add explicit discussion in the simulation section and roadmap on these practical difficulties, including strategies such as sensitivity analysis for unmeasured confounding and the need for richer observational datasets. This will better frame the assumptions while preserving the vision. revision: partial

Circularity Check

0 steps flagged

No circularity; proposal applies external causal inference methods to SE without self-referential reduction.

full rationale

The paper is a vision piece that introduces transportability methods from the existing causal inference literature (e.g., Pearl et al.), demonstrates their use via a simulation under known conditions, and outlines application scenarios and guidelines for SE researchers. No derivation chain reduces a claimed result to a fitted parameter, self-defined quantity, or load-bearing self-citation; the central proposal is an adoption argument whose content remains independent of the paper's own inputs or prior author work on the same topic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that observational SE data can be used to specify the necessary causal models for transportability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Observational data in SE can be used to identify selection variables and causal structures for transportability.
Invoked when outlining prerequisites and SE scenarios such as using students as substitutes for developers.

pith-pipeline@v0.9.0 · 5539 in / 1265 out tokens · 63442 ms · 2026-05-10T17:49:08.433416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Transport with reweighting: IPSW estimator... Plug-in g-formula... preconditions A5–A7
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

covariate shift... treatment effect modifier X... Figure 1 DAG

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 29 canonical work pages

[1]

Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: A critical review and guidelines.Empirical Software Engineering27, 4 (2022), 94. doi:10.1007/s10664-021-10072-8

work page doi:10.1007/s10664-021-10072-8 2022
[2]

Jeffrey Carver, Letizia Jaccheri, Sandro Morasca, and Forrest Shull. 2004. Issues in using students in empirical studies in software engineering education. In Proceedings. 5th international workshop on enterprise networking and computing in healthcare industry (IEEE Cat. No. 03EX717). IEEE, 239–249. doi:10.1109/METR IC.2003.1232471

work page doi:10.1109/metr 2004
[3]

Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. 2024. Causal inference methods for combining randomized trials and observational studies: a review. Statistical science39, 1 (2024), 165–191. doi:10.1214/23-STS889

work page doi:10.1214/23-sts889 2024
[4]

Bill Curtis. 1986. By the way, did anyone study any real programmers?. InPapers presented at the first workshop on empirical studies of programmers on Empirical studies of programmers. 256–262. doi:10.5555/21842.28899

work page doi:10.5555/21842.28899 1986
[5]

Oscar Dieste, Natalia Juristo, and Mauro Danilo Martínez. 2013. Software industry experiments: A systematic literature review. In2013 1st International Workshop on Conducting Empirical Studies in Industry (CESI). IEEE, 2–8. doi:10.1109/CESI.2 013.6618462

work page doi:10.1109/cesi.2 2013
[6]

Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments.Information and Software Technology48, 8 (2006), 745–755. doi:10.1016/j.infsof.2005.08.009

work page doi:10.1016/j.infsof.2005.08.009 2006
[7]

Davide Falessi, Natalia Juristo, Claes Wohlin, Burak Turhan, Jürgen Münch, Andreas Jedlitschka, and Markku Oivo. 2018. Empirical software engineering experts on the use of students and professionals in experiments.Empirical Software Engineering23, 1 (2018), 452–489. doi:10.1007/s10664-017-9523-3

work page doi:10.1007/s10664-017-9523-3 2018
[8]

Robert Feldt, Thomas Zimmermann, Gunnar R Bergersen, Davide Falessi, Andreas Jedlitschka, Natalia Juristo, Jürgen Münch, Markku Oivo, Per Runeson, Martin Shepperd, et al. 2018. Four commentaries on the use of students and professionals in empirical software engineering experiments.Empirical Software Engineering 23, 6 (2018), 3801–3820

2018
[9]

Julian Frattini, Davide Fucci, Richard Torkar, Lloyd Montgomery, Michael Un- terkalmsteiner, Jannik Fischbach, and Daniel Mendez. 2025. Applying bayesian data analysis for causal inference about requirements quality: a controlled exper- iment.Empirical Software Engineering30, 1 (2025), 29. doi:10.1007/s10664-024- 10582-1

work page doi:10.1007/s10664-024- 2025
[10]

Julian Frattini, Richard Torkar, Robert Feldt, and Carlo Furia. 2026. Replication Package. https://doi.org/10.5281/zenodo.19451793. Last accessed 2026-04-07

work page doi:10.5281/zenodo.19451793 2026
[11]

Tobias Hey, Jan Keim, and Sophie Corallo. 2024. Requirements classification for traceability link recovery. In2024 IEEE 32nd International Requirements Engineer- ing Conference (RE). IEEE, 155–167. doi:10.1109/RE59067.2024.00024

work page doi:10.1109/re59067.2024.00024 2024
[12]

In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, Shing-Chi Cheung, Alessandro Or...

work page doi:10.1145/2635868.2635929 2014
[13]

Johnson Ching-Hong Li. 2018. Curvilinear moderation—a more complete ex- amination of moderation effects in behavioral sciences.Frontiers in Applied Mathematics and Statistics4 (2018), 7. doi:10.3389/fams.2018.00007

work page doi:10.3389/fams.2018.00007 2018
[14]

2018.Statistical rethinking: A Bayesian course with examples in R and Stan

Richard McElreath. 2018.Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC. doi:10.1201/9781315372495

work page doi:10.1201/9781315372495 2018
[15]

Judea Pearl and Elias Bareinboim. 2011. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 25. 247–254

2011
[16]

Rothwell

Peter M Rothwell. 2005. External validity of randomised controlled trials:“to whom do the results of this trial apply?”.The Lancet365, 9453 (2005), 82–93. doi:10.1016/S0140-6736(04)17670-8

work page doi:10.1016/s0140-6736(04)17670-8 2005
[17]

Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students represen- tatives of professionals in software engineering experiments?. In2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676. doi:10.1109/ICSE.2015.82

work page doi:10.1109/icse.2015.82 2015
[18]

Yorick Sens, Henriette Knopp, Sven Peldszus, and Thorsten Berger. 2025. A Large-Scale Study of Model Integration in ML-Enabled Software Systems. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1165–1177. doi:10.1109/ICSE55347.2025.00185

work page doi:10.1109/icse55347.2025.00185 2025
[19]

2002.Experimental and quasi-experimental designs for generalized causal inference.Houghton, Mifflin and Company

William R Shadish, Thomas D Cook, and Donald T Campbell. 2002.Experimental and quasi-experimental designs for generalized causal inference.Houghton, Mifflin and Company. doi:10.1086/345281

work page doi:10.1086/345281 2002
[20]

Janet Siegmund and Jana Schumann. 2015. Confounding parameters on program comprehension: a literature survey.Empirical Software Engineering20, 4 (2015), 1159–1192. doi:10.1007/s10664-014-9318-8

work page doi:10.1007/s10664-014-9318-8 2015
[21]

Dag IK Sjøberg, Bente Anda, Erik Arisholm, Tore Dybå, Magne Jørgensen, Amela Karahasanović, and Marek Vokáč. 2003. Challenges and recommendations when increasing the realism of controlled software engineering experiments. InEm- pirical Methods and Studies in Software Engineering: Experiences from ESERNET. Springer, 24–38. doi:10.1007/978-3-540-45143-3_3

work page doi:10.1007/978-3-540-45143-3_3 2003
[22]

Dag IK Sjøberg and Gunnar Rye Bergersen. 2022. Construct validity in software engineering.IEEE Transactions on Software Engineering49, 3 (2022), 1374–1396. doi:10.1109/TSE.2022.3176725

work page doi:10.1109/tse.2022.3176725 2022
[23]

Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering.IEEE transactions on software engineering 31, 9 (2005), 733–753. doi:10.1109/TSE.2005.97

work page doi:10.1109/tse.2005.97 2005
[24]

Klaas-Jan Stol and Brian Fitzgerald. 2018. The ABC of software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM) 27, 3 (2018), 1–51. doi:10.1145/3241743

work page doi:10.1145/3241743 2018
[25]

2013.Gameworld Interfaces

Masashi Sugiyama and Motoaki Kawanabe. 2012.Machine learning in non- stationary environments: Introduction to covariate shift adaptation. MIT press. doi:10.7551/mitpress/9780262017091.001.0001

work page doi:10.7551/mitpress/9780262017091.001.0001 2012
[26]

Caroline B Terwee, Cecilia AC Prinsen, Alessandro Chiarotto, Marjan J West- erman, Donald L Patrick, Jordi Alonso, Lex M Bouter, Henrica CW De Vet, and Lidwine B Mokkink. 2018. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study.Quality of life research27, 5 (2018), 1159–1170. doi:10.1007/s11136-018-1829-0

work page doi:10.1007/s11136-018-1829-0 2018
[27]

Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, and Gabriele Bavota. 2025. Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1640–1652. doi:10.1109/ICSE55347.2025.00060

work page doi:10.1109/icse55347.2025.00060 2025
[28]

Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: A study of intelligence and personality.IEEE Transactions on Software Engineering 48, 12 (2021), 4789–4801. doi:10.1109/TSE.2021.3127131

work page doi:10.1109/tse.2021.3127131 2021
[29]

Michael Waldman. 1984. Worker allocation, hierarchies and the wage distribution. The Review of Economic Studies51, 1 (1984), 95–109. doi:10.2307/2297707

work page doi:10.2307/2297707 1984
[30]

2012.Experimentation in software engineering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer. doi:10.1007/978-3-662-69306-3

work page doi:10.1007/978-3-662-69306-3 2012
[31]

Marvin Wyrich and Sven Apel. 2024. Evidence Tetris in the Pixelated World of Validity Threats. InProceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering. 13–16. doi:10.1145/3643664.3648203 Received 23 January 2026; accepted 2 April 2026

work page doi:10.1145/3643664.3648203 2024