Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration

Gerardo Canfora; Lorenzo Abbondante

arxiv: 2604.25363 · v2 · pith:24OPHILHnew · submitted 2026-04-28 · 💻 cs.SE

Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration

Lorenzo Abbondante , Gerardo Canfora This is my paper

Pith reviewed 2026-05-07 16:07 UTC · model grok-4.3

classification 💻 cs.SE

keywords test case prioritizationcontinuous integrationregression testingcommit-aware predictionlearning-based TCPcross-project validationversion control diffsfault detection

0 comments

The pith

A learning model that adds structural details from code commits to coverage and history data improves test prioritization in continuous integration pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a commit-aware method that builds a predictive model for reordering tests when a new code change arrives. It pulls structural properties from version-control diffs, links them to which tests cover the changed code, and combines both with past execution records to guess which tests are likeliest to fail. The model is trained and tested across five Defects4J projects using leave-one-project-out validation so that no project-specific tuning occurs. Results indicate the added commit information lifts both the accuracy of identifying failing tests and the speed at which faults surface when tests run in the suggested order. Readers would care because regression testing in frequent CI builds consumes large resources, and earlier fault detection can cut that cost.

Core claim

Given a new commit, the method estimates for each test the probability it will reveal at least one failure by fusing structural properties extracted from version-control diffs, test coverage relations, and historical execution behavior into one predictive model, then reorders the test suite accordingly; when evaluated on five Defects4J projects under leave-one-project-out cross-project validation, this commit-aware approach significantly outperforms non-commit-aware baselines in both classification and prioritization effectiveness.

What carries the argument

The unified predictive model that combines structural properties of version-control diffs with test coverage relations and historical execution behavior to output per-test failure probabilities.

If this is right

Tests can be reordered so that regression faults appear earlier in each CI run.
The learned model works across projects without needing project-specific retraining.
Both the classification of tests expected to fail and the quality of the resulting ranking improve.
CI pipelines can expose more faults while executing fewer tests overall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Commit structural features could be added to other test-selection or bug-localization tools that already use coverage and history.
The performance lift suggests that finer-grained analysis of which parts of a diff matter most might yield still better predictors.
Teams running large suites in frequent builds could reduce total test time by adopting similar change-aware ranking.

Load-bearing premise

That structural properties extracted from version-control diffs supply predictive value beyond what test coverage and historical execution data already provide, and that the resulting model generalizes across projects without per-project tuning.

What would settle it

A new set of projects or live CI traces where the commit-aware model shows no measurable gain in fault-detection rate or average percentage of faults detected compared with the coverage-and-history baselines.

Figures

Figures reproduced from arXiv: 2604.25363 by Gerardo Canfora, Lorenzo Abbondante.

**Figure 1.** Figure 1: figure 1: the diff-based features are the ones that most influence the decision view at source ↗

**Figure 1.** Figure 1: XGBoost feature importance on Lang test set 5.2 RQ2: improved prediction effectiveness on fault detection The analysis of the prioritization quartiles reveals a distinct behavioral shift compared to the total collapse observed in the classification task, as shown in table 3. The APFD Gain demonstrates a surprising resilience in the absence of diff-based features. This indicates that the models retain a goo… view at source ↗

read the original abstract

Regression testing in Continuous Integration (CI) pipelines is increasingly costly due to the growing size and execution frequency of test suites. Test Case Prioritization (TCP) mitigates this problem by reordering tests to expose faults earlier. However, most existing techniques rely primarily on historical execution data and coverage metrics, neglecting the rich structural information contained in code changes. This paper proposes a commit-aware, learning-based TCP method that combines structural properties of version-control diffs, test coverage relations, and historical execution behavior into a unified predictive model. Given a new commit, the method estimates the probability that each test suite will reveal at least one failure and prioritizes test execution accordingly. We evaluate our method on five Defects4J projects using a leave-one-project-out cross-project validation setting. Results show that the commit-aware TCP significantly outperform non-commit-aware-baselines in both classification and prioritization effectiveness. Our findings show that including commit structural semantics substantially enhances regression fault detection and enables robust, generalizable learning-based TCP in CI environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds commit-diff structure to a learning TCP model and claims gains on five Defects4J projects under leave-one-out, but the narrow setup gives only weak support for the new features actually helping or for generalization.

read the letter

The main takeaway is that this work folds structural signals from version-control diffs into an otherwise standard learning-based test case prioritization model that already uses coverage and execution history. They train on four projects and test on the fifth across five Defects4J subjects, reporting that the commit-aware version beats the baselines on both classification accuracy and prioritization effectiveness. That cross-project protocol is a step up from the usual per-project tuning you see in this area, and the motivation around CI regression costs is straightforward and practical. The idea of treating the diff as a source of predictive features makes sense given that the commit is what triggers the test run in the first place. If the full paper spells out the exact diff-derived features and the model architecture clearly, that part could be useful to someone implementing a similar system. The soft spot is the evaluation scale and the missing checks on whether the new features actually drive the improvement. Five projects is a small sample, and they all share the same benchmark, language, and fault-injection style, so it is easy for any gains to reflect dataset artifacts rather than a general advantage. Without an ablation that removes the diff features and shows the performance drop, or results on at least one additional independent project, the claim that commit structure substantially enhances detection rests on thin evidence. The abstract gives no numbers, confidence intervals, or statistical tests, which makes it hard to judge effect size or reliability. If those details appear in the full text and survive the ablation, the concern drops; otherwise the central result stays under-supported. This is the sort of paper that could interest people who build or study CI test selection tools. A practitioner might try the feature set on their own commits, but a researcher would probably wait for stronger validation before citing or extending it. I would send it to peer review so the authors can add the ablations and perhaps one more project; the topic and setup are worth a referee's time even if the current evidence needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a commit-aware learning-based test case prioritization (TCP) technique for continuous integration that fuses structural features extracted from version-control diffs with coverage relations and historical execution data. A predictive model estimates per-test failure probability for a new commit and reorders the suite accordingly. Evaluation uses leave-one-project-out cross-validation across five Defects4J projects and reports that the commit-aware approach significantly outperforms non-commit-aware baselines on both classification and prioritization metrics.

Significance. If the empirical gains prove robust and the diff-derived features demonstrably add signal beyond coverage and history, the work would strengthen learning-based TCP by showing that structural change semantics improve fault detection effectiveness and cross-project transfer in CI settings. The LOPO protocol, if validated on a broader corpus, would support claims of generalizability.

major comments (3)

[Evaluation] Evaluation section (LOPO protocol): The use of only five Defects4J projects under leave-one-project-out provides insufficient evidence for the claim of 'robust, generalizable' learning-based TCP. All projects share the same benchmark ecosystem (Java, comparable test-suite sizes, artificially seeded faults), so observed gains may reflect dataset artifacts rather than true cross-project transfer of the commit-aware model.
[Results] Results and claims (abstract and §5): The headline assertion that commit-aware TCP 'significantly outperform[s]' baselines lacks reported quantitative metrics, statistical tests (p-values, effect sizes), confidence intervals, or ablation results that isolate the contribution of structural diff features. Without an ablation removing the diff-derived inputs while retaining coverage and history, it is impossible to confirm that the commit-aware component supplies additive predictive value.
[Method] Method and evaluation (class imbalance): The binary classification task (test reveals at least one failure) is inherently imbalanced, yet the manuscript supplies no description of imbalance handling (class weighting, oversampling, threshold tuning, or appropriate metrics such as AUC-PR). This omission undermines the reliability of the reported classification effectiveness.

minor comments (2)

[Abstract] Abstract: The phrase 'non-commit-aware-baselines' is inconsistently hyphenated and should be defined or replaced with a clearer term such as 'coverage-and-history baselines'.
[Method] Notation: The description of the unified predictive model would benefit from an explicit equation or pseudocode showing how diff features, coverage, and history are combined into the failure-probability estimate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Evaluation] The use of only five Defects4J projects under leave-one-project-out provides insufficient evidence for the claim of 'robust, generalizable' learning-based TCP. All projects share the same benchmark ecosystem (Java, comparable test-suite sizes, artificially seeded faults), so observed gains may reflect dataset artifacts rather than true cross-project transfer of the commit-aware model.

Authors: We agree that five projects constitute a modest corpus and that the shared Defects4J ecosystem limits the strength of generalizability claims. In the revision we will (i) explicitly list this as a threat to external validity, (ii) soften the language in the abstract and conclusion from 'robust, generalizable' to 'promising cross-project transfer within the Defects4J corpus', and (iii) add a dedicated paragraph outlining concrete plans for future multi-language and larger-scale evaluation. The LOPO protocol itself remains a standard and rigorous design for the available data. revision: partial
Referee: [Results] The headline assertion that commit-aware TCP 'significantly outperform[s]' baselines lacks reported quantitative metrics, statistical tests (p-values, effect sizes), confidence intervals, or ablation results that isolate the contribution of structural diff features. Without an ablation removing the diff-derived inputs while retaining coverage and history, it is impossible to confirm that the commit-aware component supplies additive predictive value.

Authors: The current manuscript reports raw performance numbers but omits the requested statistical apparatus and ablation. We will add: (a) Wilcoxon signed-rank tests with p-values and effect sizes (Cliff's delta) for all pairwise comparisons, (b) 95% confidence intervals obtained via bootstrap, and (c) a new ablation table that trains identical models with and without the diff-derived feature set while keeping coverage and history features fixed. These additions will appear in Section 5 and the supplementary material. revision: yes
Referee: [Method] The binary classification task (test reveals at least one failure) is inherently imbalanced, yet the manuscript supplies no description of imbalance handling (class weighting, oversampling, threshold tuning, or appropriate metrics such as AUC-PR). This omission undermines the reliability of the reported classification effectiveness.

Authors: Class weighting was applied during training, but the description was inadvertently omitted. The revised method section will explicitly state that we used inverse class-frequency weighting inside the gradient-boosted tree learner and that the decision threshold was chosen to maximize F1 on the validation fold. We will also report AUC-PR (and average precision) alongside AUC-ROC and accuracy to give a balanced view of performance under imbalance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks and cross-validation.

full rationale

The paper proposes a commit-aware learning-based TCP method combining diff structural properties, coverage, and history, then evaluates it empirically via leave-one-project-out cross-validation on five Defects4J projects. No equations, derivations, or first-principles results are presented that reduce to fitted parameters or self-definitions by construction. Central claims of significant outperformance are supported by direct comparisons to non-commit-aware baselines on a public benchmark dataset, without load-bearing self-citations, ansatz smuggling, or renaming of known results. The LOPO protocol and external baselines render the evaluation self-contained and falsifiable outside any internal fitting loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on the assumption that commit diffs contain useful structural signals for failure prediction and that cross-project generalization is feasible; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5466 in / 1004 out tokens · 56119 ms · 2026-05-07T16:07:33.832563+00:00 · methodology

Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)