Recognition: unknown
How to benchmark: the Measure-Explain-Test-Improve loop
Pith reviewed 2026-05-08 01:47 UTC · model grok-4.3
The pith
Following the Measure-Explain-Test-Improve loop produces solid performance evaluations even for researchers new to benchmarking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that researchers without prior benchmarking experience can build solid performance evaluations by following the Measure-Explain-Test-Improve loop: measure relevant quantities under controlled conditions, explain the observed numbers, test whether the explanation holds under further variation, and improve either the measured artifact or the measurement process until the results are trustworthy.
What carries the argument
The Measure-Explain-Test-Improve loop, an iterative cycle that structures data collection, interpretation, hypothesis testing, and refinement to make performance claims reliable.
If this is right
- Performance claims in papers become easier to justify because each number is tied to an explicit explanation and test.
- Common sources of benchmarking error, such as unaccounted variability or missing baselines, are surfaced early in the cycle.
- Iterative improvement applies equally to the research artifact and to the evaluation method itself.
- Researchers can integrate performance work into projects whose primary contributions are formal or expressive without it becoming an afterthought.
- The loop provides a shared vocabulary that makes benchmarking advice transferable across different projects.
Where Pith is reading between the lines
- The same loop structure could be applied in other computer science subfields where performance is a secondary concern, such as systems or human-computer interaction.
- Adoption might reduce the frequency of papers that report speedups without explaining measurement conditions or variability.
- Explicit documentation of each loop iteration could become a lightweight addition to supplementary material or artifact submissions.
- If the loop is taught in graduate courses, the quality of performance sections in student papers and theses could be tracked over time.
Load-bearing premise
The author's impression that current benchmarking practices in the field are typically poor is accurate and the proposed loop will produce solid evaluations without needing further empirical validation.
What would settle it
A collection of research papers that each follow the Measure-Explain-Test-Improve steps yet still contain easily rebutted or non-reproducible performance claims would show the loop does not reliably produce solid evaluations.
read the original abstract
I would like to share recommendations on how to do performance benchmarks for the purpose of computer science research evaluation. Research in my field (programming language research) often involves performance considerations, but it is typically not the main tool used to evaluate our research (typically we evaluate via formal statements and their proofs, experience writing large or interesting examples, or systematic comparison of expressivity, feature set, etc.). My impression is that, as a result, we tend to not do our performance evaluation very well. In the present document I will try to explain a methodology to do benchmarking correctly (I hope!). People with no former benchmarking experience should be able to build solid performance evaluation as part of their research. I explain the justification for each aspect along the way.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Measure-Explain-Test-Improve loop as a structured methodology for conducting performance benchmarks in programming language research. The author argues that performance evaluation is typically secondary to formal proofs and expressivity comparisons in the field, leading to suboptimal practices, and provides justifications for each step in the loop so that researchers without prior benchmarking experience can produce solid evaluations.
Significance. If the loop's steps and justifications prove practical and complete, the work could help raise the standard of performance evaluation in PL research by offering an accessible, step-by-step process with explicit rationales. The advisory nature means its value lies in usability rather than novel theorems or data.
major comments (1)
- [Abstract] The central claim that novices 'should be able to build solid performance evaluation as part of their research by following the Measure-Explain-Test-Improve loop' (abstract) is load-bearing yet unsupported: the manuscript provides no concrete examples, worked case studies, or validation demonstrating that the loop produces solid results when applied by inexperienced researchers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the primary concern below and will revise the manuscript accordingly to strengthen the presentation of the methodology.
read point-by-point responses
-
Referee: [Abstract] The central claim that novices 'should be able to build solid performance evaluation as part of their research by following the Measure-Explain-Test-Improve loop' (abstract) is load-bearing yet unsupported: the manuscript provides no concrete examples, worked case studies, or validation demonstrating that the loop produces solid results when applied by inexperienced researchers.
Authors: We agree that the claim in the abstract would be better supported by concrete illustrations. Although the manuscript provides detailed justifications for each step of the loop to enable researchers (including those without prior experience) to conduct solid evaluations, the absence of worked examples leaves the practical applicability less evident. In the revised manuscript, we will add a dedicated section containing 2-3 worked case studies drawn from typical programming language research scenarios (e.g., evaluating a new compiler optimization or language feature). Each case study will walk through the Measure-Explain-Test-Improve loop in detail, showing how following the process leads to more rigorous and defensible benchmarking. This addition will directly address the concern by demonstrating the loop's effectiveness in practice. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is a methodological guide offering recommendations for performance benchmarking via the Measure-Explain-Test-Improve loop. It contains no equations, derivations, fitted parameters, predictions, or uniqueness theorems. The central content is advisory, explaining justifications for each step based on the author's impressions of field practices, without any self-citation chains or reductions of claims to inputs by construction. The work is self-contained as a set of practical guidelines and does not advance falsifiable technical results that could exhibit circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
How not to lie with statistics: the correct way to summa rize benchmark results. Commun. ACM 29, 3, 218–221. https://doi.org/10.1145/5666.5673. Andy Georges, Dries Buytaert, and Lieven Eeckhout
-
[2]
Statistically rigorous java performance evaluation. Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object Oriented Pro gramming Systems, Languages and Applications, Association for Computing Machinery, 57–76. https://doi.org/10.1145/1297027.1297033. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney
-
[3]
Producing wrong data without doing anything obviously wrong!. Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems , ACM, 265–276. https://doi.org/10.1145/1508244.1508275. Jan Vitek and Tomas Kalibera
-
[4]
R3: repeatability, reproducibility and rigor. SIGPLAN Not. 47, 4a, 30–36. https://doi.org/10.1145/2442776.2442781. Charlie Curtsinger and Emery D. Berger
-
[5]
Edd Barrett, Carl Friedrich BolzTereick, Rebecca Killick, Sarah Mount, and Laurence Tratt
https://doi.org/10.1145/2451116.2451141. Edd Barrett, Carl Friedrich BolzTereick, Rebecca Killick, Sarah Mount, and Laurence Tratt
-
[6]
Virtual machine warmup blows hot and cold. Proc. ACM Program. Lang. 1, OOPSLA. https://doi. org/10.1145/3133876. Steve Blackburn, Matthias Hauswirth, Emery Berger, Michael Hicks, and Shriram Krishna murthi
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.