arxiv: 2604.25778 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.AI· cs.IR

Recognition: unknown

Can Code Evaluation Metrics Detect Code Plagiarism?

Fahad Ebrahim , Mike Joy (The University of Warwick)

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.IR

keywords code plagiarism detectioncode evaluation metricsCodeBLEUCrystalBLEUranking metricssource code plagiarismConPlagIRPlag

0 comments

The pith

Code evaluation metrics rank plagiarized code as well as specialized detection tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether metrics built to score generated code can also identify plagiarism by comparing pairs at six increasing levels of modification. It runs five such metrics against two labeled datasets and pits them against established plagiarism tools using ranking measures that avoid arbitrary thresholds. Several metrics match or exceed the tools overall once basic preprocessing is applied, with the strongest results at low modification levels and continued competitiveness at the highest levels. This matters because it suggests existing code assessment tools could serve dual purposes in education without requiring separate plagiarism systems. The work focuses on ranking quality rather than binary classification.

Core claim

On the ConPlag and IRPlag datasets, Dolos leads without preprocessing while CrystalBLEU, CodeBLEU, and RUBY outperform JPlag; after preprocessing CrystalBLEU surpasses Dolos overall. Performance is highest at L1 and declines from L4, yet CrystalBLEU remains competitive at L6. Per-dataset results vary, with Dolos strongest on raw ConPlag and CrystalBLEU best on the others, leading to the conclusion that code evaluation metrics achieve comparable ranking performance to dedicated plagiarism tools.

What carries the argument

Threshold-free ranking-based evaluation measures applied to code pairs across L1-L6 plagiarism levels on the ConPlag (raw and template-free) and IRPlag datasets, directly comparing five code evaluation metrics to JPlag and Dolos.

If this is right

Instructors could reuse code evaluation metrics already in their grading pipelines for initial plagiarism screening.
Adding simple preprocessing steps noticeably boosts metric effectiveness on plagiarism ranking tasks.
Detection strength drops as modification complexity increases, so tools may need level-specific adjustments.
CrystalBLEU in particular offers a practical option for harder plagiarism levels where other methods weaken.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metrics might help flag AI-generated code that closely mimics human student submissions if the ranking patterns transfer.
Testing the approach on additional programming languages would show whether certain metrics have language-specific advantages.
Integration into submission platforms could enable lightweight, always-on checks without new dedicated software.

Load-bearing premise

That results from these two specific labeled datasets with controlled modification levels will hold for real student submissions in other languages and practical classroom settings.

What would settle it

A follow-up test on a fresh collection of verified real-world student plagiarism cases in which the leading code evaluation metrics produce markedly lower ranking scores than JPlag or Dolos.

Figures

Figures reproduced from arXiv: 2604.25778 by Fahad Ebrahim, Mike Joy (The University of Warwick).

**Figure 1.** Figure 1: High-Level Methodology. 3 Code Evaluation Metrics as Plagiarism Detectors This section presents the overall methodology, datasets, selected CEMs and their details, and performance evaluation metrics. 3.1 Overall Methodology The high-level methodology used in this paper is presented in view at source ↗

**Figure 2.** Figure 2: ROC curve across the pooled full datasets. view at source ↗

**Figure 3.** Figure 3: PR curve across the pooled full datasets. view at source ↗

**Figure 5.** Figure 5: PR curve across the pooled full datasets with pre view at source ↗

read the original abstract

Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Structured Edit Distance (TSED), and CodeBERTScore. The performance is evaluated using threshold-free ranking-based measures to assess overall, per dataset, and per-level plagiarism performance. The results are compared against state-of-the-art (SOTA) Source Code Plagiarism Detection Tools (SCPDTs), JPlag and Dolos. Our findings show that without preprocessing, Dolos achieves the highest overall ranking performance, while among the individual metrics, CrystalBLEU, CodeBLEU, and RUBY outperform JPlag. Performance is strongest at L1 and drops from L4 onward, while CrystalBLEU remains competitive on L6. With preprocessing, CrystalBLEU surpasses Dolos overall. Per dataset, Dolos achieved the best ranking on the ConPlag raw dataset, while CrystalBLEU was the best-performing metric on the remaining datasets. At the plagiarism levels, Dolos remains strongest on L4, while Crystal-BLEU leads most of the remaining difficult levels. These results indicate that CEMs are comparable to dedicated tools in terms of ranking metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparison of five Code Evaluation Metrics (CEMs: CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore) against two dedicated Source Code Plagiarism Detection Tools (JPlag and Dolos) on the ConPlag (raw and template-free) and IRPlag datasets. It evaluates performance across six synthetic plagiarism levels (L1-L6) using threshold-free ranking-based measures, reporting that CEMs (especially CrystalBLEU) achieve comparable or superior ranking performance to the SOTA tools, with strongest results at low modification levels and sensitivity to preprocessing.

Significance. If the central ranking-based findings hold, the work demonstrates that metrics designed for code generation evaluation can serve as viable alternatives or complements to specialized plagiarism detectors, potentially lowering barriers for educators. Strengths include the use of open labeled datasets, direct head-to-head comparison with JPlag/Dolos, and avoidance of arbitrary similarity thresholds. However, significance is limited by the controlled nature of the L1-L6 modifications and lack of validation on authentic student code.

major comments (2)

[Results section (per-level and overall ranking)] Results section (per-level and overall ranking): The reported performance drop after L4 and the preprocessing-dependent reversal (CrystalBLEU overtaking Dolos) are load-bearing for the comparability claim, yet no statistical significance tests (e.g., Wilcoxon signed-rank on the ranking metrics) or confidence intervals are provided to establish whether observed differences exceed noise.
[Evaluation and discussion] Evaluation and discussion: The central claim that CEMs are 'comparable to dedicated tools' rests on synthetic L1-L6 modifications in ConPlag/IRPlag; the manuscript contains no experiments or analysis on authentic student submissions (which often involve semantic rewrites, library changes, or cross-language elements outside the L1-L6 taxonomy), leaving generalization untested and weakening applicability to real plagiarism detection.

minor comments (2)

[Abstract] Abstract: Inconsistent naming of 'CrystalBLEU' (sometimes hyphenated as 'Crystal-BLEU') should be standardized throughout.
[Discussion or Conclusion] The paper would benefit from an explicit limitations subsection addressing dataset construction biases and the threshold-free assumption's sensitivity to ranking ties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where we agree and will revise the paper accordingly.

read point-by-point responses

Referee: Results section (per-level and overall ranking): The reported performance drop after L4 and the preprocessing-dependent reversal (CrystalBLEU overtaking Dolos) are load-bearing for the comparability claim, yet no statistical significance tests (e.g., Wilcoxon signed-rank on the ranking metrics) or confidence intervals are provided to establish whether observed differences exceed noise.

Authors: We agree that statistical significance testing would strengthen the claims regarding the performance drop after L4 and the preprocessing effects. In the revised manuscript, we will incorporate Wilcoxon signed-rank tests on the ranking metrics to compare methods and add confidence intervals for the key results to assess whether differences exceed noise. revision: yes
Referee: Evaluation and discussion: The central claim that CEMs are 'comparable to dedicated tools' rests on synthetic L1-L6 modifications in ConPlag/IRPlag; the manuscript contains no experiments or analysis on authentic student submissions (which often involve semantic rewrites, library changes, or cross-language elements outside the L1-L6 taxonomy), leaving generalization untested and weakening applicability to real plagiarism detection.

Authors: We acknowledge that the evaluation relies on controlled synthetic modifications, which enable systematic analysis of specific levels but do not capture all real-world plagiarism behaviors such as semantic rewrites or cross-language changes. This limits direct generalization. We will expand the discussion to explicitly address this limitation, clarify the scope of our comparability claims, and suggest future validation on authentic student submissions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical comparison

full rationale

The paper reports an empirical study that directly measures ranking performance of five CEMs against JPlag and Dolos on the publicly available ConPlag and IRPlag datasets across L1-L6 levels. All reported results (overall rankings, per-dataset, per-level) follow from the experimental protocol, threshold-free metrics, and external tool outputs without any fitted parameters, self-definitions, or load-bearing self-citations that reduce the claims to the inputs by construction. The derivation chain is therefore self-contained and consists of standard benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the datasets serving as valid proxies for plagiarism and on ranking measures being suitable proxies for detection capability; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The ConPlag and IRPlag datasets accurately represent real-world plagiarism cases at levels L1-L6.
Used as the basis for all performance measurements across modification complexities.
domain assumption Threshold-free ranking metrics provide a valid way to compare plagiarism detection performance.
Central to the evaluation methodology and comparisons reported.

pith-pipeline@v0.9.0 · 5603 in / 1346 out tokens · 84882 ms · 2026-05-07T16:10:58.294517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Cîmpeanu Alexandra-Cristina and Alexandru-Corneliu Olteanu. 2022. Material survey on source code plagiarism detection in programming courses. In2022 International Conference on Advanced Learning Technologies (ICALT). IEEE, 387– 389

2022
[2]

Georgina Cosma and Mike Joy. 2008. Towards a definition of source-code plagia- rism.IEEE Transactions on Education51, 2 (2008), 195–200

2008
[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

2019
[4]

Thomas J DiCiccio and Bradley Efron. 1996. Bootstrap confidence intervals. Statistical science11, 3 (1996), 189–228

1996
[5]

Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Code- Score: Evaluating code generation by learning code execution.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

2025
[6]

Simantika Bhattacharjee Dristi and Matthew B Dwyer. 2025. Analyzing and Mit- igating Surface Bias in Code Evaluation Metrics.arXiv preprint arXiv:2509.15397 (2025)

work page arXiv 2025
[7]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitch- hiker’s guide to testing statistical significance in natural language processing. InProceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers). 1383–1392

2018
[8]

Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: precisely and efficiently measuring the similarity of code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

2022
[9]

Jinan AW Faidhi and Stuart K Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education11, 1 (1987), 11–19

1987
[10]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. InFindings of the association for computational linguistics: EMNLP 2020. 1536–1547

2020
[11]

2020.Machine learning and artificial intelligence

Ameet V Joshi. 2020.Machine learning and artificial intelligence. Vol. 261. Springer. Conference’17, July 2017, Washington, DC, USA Fahad Ebrahim and Mike Joy

2020
[12]

Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019. Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation.Informatics in Education18, 2 (2019), 321–344. doi:10.15388/infedu. 2019.15

work page doi:10.15388/infedu 2019
[13]

Rien Maertens, Peter Dawyndt, and Bart Mesuere. 2023. Dolos 2.0: Towards seamless source code plagiarism detection in online learning environments. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2. 632–632

2023
[14]

Rien Maertens, Peter Dawyndt, and Bart Mesuere. 2025. Source Code Plagiarism Detection as a Service with Dolos. InProceedings of the 30th ACM Conference on Innovation and Technology in Computer Science Education V. 2. 729–730

2025
[15]

Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, and Bart Mesuere. 2024. Discovering and exploring cases of educational source code plagiarism with Dolos.SoftwareX26 (2024), 101755

2024
[16]

Rien Maertens, Charlotte Van Petegem, Niko Strijbol, Toon Baeyens, Arne Carla Jacobs, Peter Dawyndt, and Bart Mesuere. 2022. Dolos: Language-agnostic plagiarism detection in source code.Journal of Computer Assisted Learning38, 4 (2022), 1046–1061

2022
[17]

Matthew B McDermott, Haoran Zhang, Lasse H Hansen, Giovanni Angelotti, and Jack Gallifant. 2024. A closer look at AUROC and AUPRC under class imbalance. Advances in Neural Information Processing Systems37 (2024), 44102–44163

2024
[18]

Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Aziz Bonkoungou, Miche- line Bénédicte Moumoula, Jordan Samhi, Abdoul Kader Kabore, Jacques Klein, and Tegawendé F Bissyande. 2025. How Small Transformation Expose the Weakness of Semantic Similarity Measures.arXiv preprint arXiv:2509.09714(2025)

work page arXiv 2025
[19]

Matija Novak, Mike Joy, and Dragutin Kermek. 2019. Source-code similarity detection and detection tools used in academia: a systematic review.ACM Transactions on Computing Education (TOCE)19, 3 (2019), 1–37

2019
[20]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

2002
[21]

Lutz Prechelt, Guido Malpohl, Michael Philippsen, et al. 2002. Finding plagiarisms among a set of programs with JPlag.J. Univers. Comput. Sci.8, 11 (2002), 1016

2002
[22]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297(2020)

work page internal anchor Pith review arXiv 2020
[23]

Timur Sağlam, Sebastian Hahner, Larissa Schmid, and Erik Burger. 2024. Obfuscation-resilient software plagiarism detection with jplag. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 264–265

2024
[24]

Timur Sağlam, Nils Niehues, Sebastian Hahner, and Larissa Schmid. 2025. Miti- gating Obfuscation Attacks on Software Plagiarism Detectors via Subsequence Merging. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 217–228

2025
[25]

Evgeniy Slobodkin and Alexander Sadovnikov. 2023. Towards a Dataset of Programming Contest Plagiarism in Java. InProceeding of the 33rd Conference of FRUCT Association. 386–390

2023
[26]

Yewei Song, Cedric Lothritz, Xunzhu Tang, Tegawendé Bissyandé, and Jacques Klein. 2024. Revisiting code similarity evaluation with abstract syntax tree edit distance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 38–46

2024
[27]

Ngoc Tran, Hieu Tran, Son Nguyen, Hoan Nguyen, and Tien Nguyen. 2019. Does BLEU score work for code migration?. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 165–176

2019
[28]

Michael J Wise. 1993. String similarity via greedy string tiling and running Karp-Rabin matching.Online Preprint, Dec119, 1 (1993), 1–17

1993
[29]

Morteza Zakeri-Nasrabadi, Saeed Parsa, Mohammad Ramezani, Chanchal Roy, and Masoud Ekhtiarzadeh. 2023. A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and chal- lenges.Journal of Systems and Software204 (2023), 111796

2023
[30]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi
[31]

BERTScore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

work page internal anchor Pith review arXiv 1904
[32]

Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. Code- BERTScore: Evaluating code generation with pretrained models of code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 13921–13937. A Bootstrapped Confidence Intervals and Paired Comparisons A.1 Pooled Confidence Intervals Tables 4 and 5 report b...

2023