Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
14th Intl
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Ensemble voting strategies for change point detection improve F1-score by 11% over Mozilla's T-test method on a new ground-truth dataset of 174 performance time series annotated by practitioners.
Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.
citing papers explorer
-
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
-
Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla
Ensemble voting strategies for change point detection improve F1-score by 11% over Mozilla's T-test method on a new ground-truth dataset of 174 performance time series annotated by practitioners.
-
Misleading Microbenchmarks on the Java Virtual Machines
Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.