arxiv: 2605.02233 · v1 · submitted 2026-05-04 · 💻 cs.PL

Recognition: unknown

How to benchmark: the Measure-Explain-Test-Improve loop

Gabriel Scherer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:47 UTC · model grok-4.3

classification 💻 cs.PL

keywords benchmarkingperformance evaluationprogramming languagesresearch methodologyperformance measurementiterative improvementcomputer science research practices

0 comments

The pith

Following the Measure-Explain-Test-Improve loop produces solid performance evaluations even for researchers new to benchmarking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a practical four-step methodology for conducting performance benchmarks in programming language research, where such evaluations are secondary to formal proofs and expressivity comparisons. It argues that many current practices fall short and that novices can achieve reliable results by systematically measuring data, explaining observations, testing hypotheses about causes, and iterating to improve both the system and the evaluation itself. A sympathetic reader would care because flawed benchmarks undermine claims about research contributions and waste effort on inconclusive experiments. The approach supplies justifications for each step so that the process itself becomes teachable without prior specialized experience.

Core claim

The central claim is that researchers without prior benchmarking experience can build solid performance evaluations by following the Measure-Explain-Test-Improve loop: measure relevant quantities under controlled conditions, explain the observed numbers, test whether the explanation holds under further variation, and improve either the measured artifact or the measurement process until the results are trustworthy.

What carries the argument

The Measure-Explain-Test-Improve loop, an iterative cycle that structures data collection, interpretation, hypothesis testing, and refinement to make performance claims reliable.

If this is right

Performance claims in papers become easier to justify because each number is tied to an explicit explanation and test.
Common sources of benchmarking error, such as unaccounted variability or missing baselines, are surfaced early in the cycle.
Iterative improvement applies equally to the research artifact and to the evaluation method itself.
Researchers can integrate performance work into projects whose primary contributions are formal or expressive without it becoming an afterthought.
The loop provides a shared vocabulary that makes benchmarking advice transferable across different projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop structure could be applied in other computer science subfields where performance is a secondary concern, such as systems or human-computer interaction.
Adoption might reduce the frequency of papers that report speedups without explaining measurement conditions or variability.
Explicit documentation of each loop iteration could become a lightweight addition to supplementary material or artifact submissions.
If the loop is taught in graduate courses, the quality of performance sections in student papers and theses could be tracked over time.

Load-bearing premise

The author's impression that current benchmarking practices in the field are typically poor is accurate and the proposed loop will produce solid evaluations without needing further empirical validation.

What would settle it

A collection of research papers that each follow the Measure-Explain-Test-Improve steps yet still contain easily rebutted or non-reproducible performance claims would show the loop does not reliably produce solid evaluations.

read the original abstract

I would like to share recommendations on how to do performance benchmarks for the purpose of computer science research evaluation. Research in my field (programming language research) often involves performance considerations, but it is typically not the main tool used to evaluate our research (typically we evaluate via formal statements and their proofs, experience writing large or interesting examples, or systematic comparison of expressivity, feature set, etc.). My impression is that, as a result, we tend to not do our performance evaluation very well. In the present document I will try to explain a methodology to do benchmarking correctly (I hope!). People with no former benchmarking experience should be able to build solid performance evaluation as part of their research. I explain the justification for each aspect along the way.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scherer's paper organizes standard benchmarking advice into a clear four-step loop for PL researchers but provides no examples or tests of whether the loop actually works.

read the letter

This is a practical guide rather than a research result. Scherer describes the Measure-Explain-Test-Improve loop as a way for PL people with little benchmarking experience to produce decent performance numbers alongside their main contributions like proofs or expressivity comparisons. The structure is straightforward and the justifications for each step are laid out plainly, which is the main strength. In a subfield where benchmarks are secondary, having an explicit checklist can reduce the chance of skipping explanations or robustness checks. The paper does not claim new techniques; it repackages existing ideas into a memorable sequence tailored to the typical PL workflow. That organization is useful on its own. The soft spot is the complete absence of examples, case studies, or any check on whether following the loop produces better evaluations than current practice. The advice rests on the author's impression that benchmarking is often weak in the field, but nothing in the document shows data or even a worked instance to support that the loop fixes the problems. Without that, the claim that novices will reliably get solid results stays unverified. The paper is aimed at PL researchers who need to add performance numbers to their papers but are not performance specialists. A reader looking for a simple framework to follow could pick up some habits from it. It does not look like material for standard peer review in a research journal, because there are no claims, derivations, or findings to referee. It would fit better as a short note or blog post that people can read and adapt. I would not send it to referees.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes the Measure-Explain-Test-Improve loop as a structured methodology for conducting performance benchmarks in programming language research. The author argues that performance evaluation is typically secondary to formal proofs and expressivity comparisons in the field, leading to suboptimal practices, and provides justifications for each step in the loop so that researchers without prior benchmarking experience can produce solid evaluations.

Significance. If the loop's steps and justifications prove practical and complete, the work could help raise the standard of performance evaluation in PL research by offering an accessible, step-by-step process with explicit rationales. The advisory nature means its value lies in usability rather than novel theorems or data.

major comments (1)

[Abstract] The central claim that novices 'should be able to build solid performance evaluation as part of their research by following the Measure-Explain-Test-Improve loop' (abstract) is load-bearing yet unsupported: the manuscript provides no concrete examples, worked case studies, or validation demonstrating that the loop produces solid results when applied by inexperienced researchers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the primary concern below and will revise the manuscript accordingly to strengthen the presentation of the methodology.

read point-by-point responses

Referee: [Abstract] The central claim that novices 'should be able to build solid performance evaluation as part of their research by following the Measure-Explain-Test-Improve loop' (abstract) is load-bearing yet unsupported: the manuscript provides no concrete examples, worked case studies, or validation demonstrating that the loop produces solid results when applied by inexperienced researchers.

Authors: We agree that the claim in the abstract would be better supported by concrete illustrations. Although the manuscript provides detailed justifications for each step of the loop to enable researchers (including those without prior experience) to conduct solid evaluations, the absence of worked examples leaves the practical applicability less evident. In the revised manuscript, we will add a dedicated section containing 2-3 worked case studies drawn from typical programming language research scenarios (e.g., evaluating a new compiler optimization or language feature). Each case study will walk through the Measure-Explain-Test-Improve loop in detail, showing how following the process leads to more rigorous and defensible benchmarking. This addition will directly address the concern by demonstrating the loop's effectiveness in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a methodological guide offering recommendations for performance benchmarking via the Measure-Explain-Test-Improve loop. It contains no equations, derivations, fitted parameters, predictions, or uniqueness theorems. The central content is advisory, explaining justifications for each step based on the author's impressions of field practices, without any self-citation chains or reductions of claims to inputs by construction. The work is self-contained as a set of practical guidelines and does not advance falsifiable technical results that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, data, or fitted parameters are present in the abstract; the content is advisory.

pith-pipeline@v0.9.0 · 5412 in / 931 out tokens · 79048 ms · 2026-05-08T01:47:41.234347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

How not to lie with statistics: the correct way to summa rize benchmark results. Commun. ACM 29, 3, 218–221. https://doi.org/10.1145/5666.5673. Andy Georges, Dries Buytaert, and Lieven Eeckhout

work page doi:10.1145/5666.5673
[2]

Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object Oriented Pro gramming Systems, Languages and Applications, Association for Computing Machinery, 57–76

Statistically rigorous java performance evaluation. Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object Oriented Pro gramming Systems, Languages and Applications, Association for Computing Machinery, 57–76. https://doi.org/10.1145/1297027.1297033. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney

work page doi:10.1145/1297027.1297033
[3]

Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems , ACM, 265–276

Producing wrong data without doing anything obviously wrong!. Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems , ACM, 265–276. https://doi.org/10.1145/1508244.1508275. Jan Vitek and Tomas Kalibera

work page doi:10.1145/1508244.1508275
[4]

SIGPLAN Not

R3: repeatability, reproducibility and rigor. SIGPLAN Not. 47, 4a, 30–36. https://doi.org/10.1145/2442776.2442781. Charlie Curtsinger and Emery D. Berger

work page doi:10.1145/2442776.2442781
[5]

Edd Barrett, Carl Friedrich BolzTereick, Rebecca Killick, Sarah Mount, and Laurence Tratt

https://doi.org/10.1145/2451116.2451141. Edd Barrett, Carl Friedrich BolzTereick, Rebecca Killick, Sarah Mount, and Laurence Tratt

work page doi:10.1145/2451116.2451141
[6]

Virtual machine warmup blows hot and cold. Proc. ACM Program. Lang. 1, OOPSLA. https://doi. org/10.1145/3133876. Steve Blackburn, Matthias Hauswirth, Emery Berger, Michael Hicks, and Shriram Krishna murthi

work page doi:10.1145/3133876