Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
Pith reviewed 2026-05-25 13:39 UTC · model grok-4.3
The pith
Combining math content and citation similarity with text analysis improves detection of concealed plagiarism in STEM documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a two-stage process combining assessments of mathematical content similarity, academic citation similarity, and text similarity, using newly developed order-sensitive measures for mathematical features, outperforms text-only approaches in identifying confirmed cases of academic plagiarism and can flag suspicious documents within a collection of 102,000 STEM publications.
What carries the argument
The two-stage detection process integrating math-based, citation-based, and text-based similarity measures, with new measures that incorporate the order of mathematical features.
If this is right
- The new order-aware similarity measures for mathematical features outperform the measures from prior work.
- Combined math and citation analysis identifies potentially suspicious cases inside a large collection of 102K STEM documents.
- Math-based and citation-based features serve as a supplement to text-based detection for concealed plagiarism.
- Direct comparison on confirmed cases shows measurable gains from the multi-feature approach.
Where Pith is reading between the lines
- Detection systems could incorporate domain-specific non-text features like equations as a standard layer for technical literature.
- Similar ordered-feature analysis might be applied to diagrams, tables, or data sets to address additional reuse patterns.
- Large-scale screening of submissions could become feasible if the method proves efficient on production collections.
Load-bearing premise
The confirmed cases of academic plagiarism used for evaluation are representative of concealed forms such as strong paraphrases, translations, and idea reuse.
What would settle it
A new test set of confirmed plagiarism cases in which the combined math-plus-citation approach flags no additional instances beyond those already caught by text analysis alone would falsify the improvement claim.
Figures
read the original abstract
Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends prior work on plagiarism detection in STEM documents by proposing a two-stage process that integrates similarity measures for mathematical content (including new order-aware features), academic citations, and text. It claims these new math measures outperform prior versions, that the combined math/citation/text approaches are effective when evaluated on confirmed plagiarism cases, and that applying the math+citation combination to a 102K-document STEM collection identifies suspicious cases. The central claim is that math and citation analysis provides a striking supplement to conventional text-based methods specifically for detecting concealed forms of plagiarism such as strong paraphrases, translations, and idea reuse.
Significance. If the evaluation holds, the work would meaningfully advance detection of non-textual and concealed plagiarism in STEM by exploiting domain-specific signals (ordered math expressions and citation patterns) that are harder to disguise than text. The scale of the 102K-document demonstration and the focus on order-aware math measures are positive elements that could inform practical systems if the representativeness of the ground-truth cases is established.
major comments (2)
- [Contribution (iii) and evaluation section] Contribution (iii) and the associated evaluation section: the claim that the math-based and citation-based approaches supplement text-based detection for concealed plagiarism rests on performance differences observed on 'confirmed cases of academic plagiarism.' The manuscript does not report the breakdown of these cases by concealment type (verbatim/light rewording vs. strong paraphrases, translations, or idea reuse), which is load-bearing for the central claim; if the confirmed set is dominated by easily detectable verbatim copies, the comparative results do not establish added value in the concealed-plagiarism regime highlighted in the abstract and skeptic note.
- [Two-stage process and math similarity measures section] Section describing the two-stage detection process and new order-aware math measures: the outperformance of the new measures over prior work is asserted, but without explicit reporting of statistical significance tests, effect sizes, or controls for post-hoc threshold selection on the confirmed cases, it is unclear whether the gains are robust or depend on dataset-specific tuning.
minor comments (2)
- [Abstract] The abstract and introduction use 'striking supplement' without quantifying the improvement (e.g., precision/recall deltas); a concrete metric comparison would strengthen the presentation.
- [Mathematical content similarity section] Notation for the order-aware math features should be defined more explicitly when first introduced to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our evaluation that we will address through revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Contribution (iii) and evaluation section] Contribution (iii) and the associated evaluation section: the claim that the math-based and citation-based approaches supplement text-based detection for concealed plagiarism rests on performance differences observed on 'confirmed cases of academic plagiarism.' The manuscript does not report the breakdown of these cases by concealment type (verbatim/light rewording vs. strong paraphrases, translations, or idea reuse), which is load-bearing for the central claim; if the confirmed set is dominated by easily detectable verbatim copies, the comparative results do not establish added value in the concealed-plagiarism regime highlighted in the abstract and skeptic note.
Authors: We agree that explicitly reporting the breakdown of confirmed cases by concealment type would strengthen support for the central claim regarding concealed plagiarism. We will revise the evaluation section to include this breakdown based on the available case metadata. revision: yes
-
Referee: [Two-stage process and math similarity measures section] Section describing the two-stage detection process and new order-aware math measures: the outperformance of the new measures over prior work is asserted, but without explicit reporting of statistical significance tests, effect sizes, or controls for post-hoc threshold selection on the confirmed cases, it is unclear whether the gains are robust or depend on dataset-specific tuning.
Authors: We acknowledge the need for statistical rigor. We will add significance tests and effect sizes to the revised manuscript. We will also clarify the threshold selection procedure and add any necessary controls to demonstrate it was not performed post-hoc on the evaluation cases. revision: yes
Circularity Check
No significant circularity; evaluation relies on external ground truth
full rationale
The paper's contributions consist of a two-stage detection process, new order-aware similarity measures for math features, and empirical comparisons on confirmed external plagiarism cases plus an independent 102K-document collection. No equations or derivations reduce by construction to fitted parameters or self-referential definitions. Self-citation to prior work on math/citation analysis is present but not load-bearing, as the effectiveness claims are validated against independent confirmed cases rather than derived from the cited prior results. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014. NTCIR- 11 Math-2 Task Overview. In Proc. NTCIR
work page 2014
-
[2]
Alzahrani, Naomie Salim, and Ajith Abraham
Salha M. Alzahrani, Naomie Salim, and Ajith Abraham. 2012. Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. In IEEE Trans. Syst., Man, Cybern. C, Appl. Rev. , Vol. 42. 133–149
work page 2012
-
[3]
Alberto Barrón-Cedeño, Parth Gupta, and Paolo Rosso. 2013. Methods for Cross- language Plagiarism Detection. Know.-Based Syst. 50 (2013), 211–217
work page 2013
-
[4]
Hannah Bast and Claudius Korzen. 2017. A Benchmark and Evaluation for Text Extraction from PDF. In Proc. JCDL
work page 2017
-
[5]
Zdenek Ceska. 2008. Plagiarism Detection Based on Singular Value Decomposi- tion. In Advances in Natural Language Processing . LNCS, Vol. 5221. Springer
work page 2008
-
[6]
Nava Ehsan and Azadeh Shakery. 2016. Candidate Document Retrieval for Cross- lingual Plagiarism Detection Using Two-level Proximity Information. Inf. Process. Manage. 52, 6 (2016), 1004–1017
work page 2016
-
[7]
Nava Ehsan, Frank Wm. Tompa, and Azadeh Shakery. 2016. Using a Dictionary and N-gram Alignment to Improve Fine-grained Cross-Language Plagiarism Detection. In Proc. DocEng
work page 2016
-
[8]
Teddy Fishman. 2009. "We know it when we see it"? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity
work page 2009
-
[9]
Bela Gipp. 2014. Citation-based Plagiarism Detection - Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis . Springer
work page 2014
-
[10]
Bela Gipp and Norman Meuschke. 2011. Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. In Proc. DocEng
work page 2011
-
[11]
Bela Gipp, Norman Meuschke, and Joeran Beel. 2011. Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag. In Proc. JCDL
work page 2011
-
[12]
Bela Gipp, Norman Meuschke, Corinna Breitinger, Jim Pitman, and Andreas Nuernberger. 2014. Web-based Demonstration of Semantic Similarity Detection using Citation Pattern Visualization for a Cross Language Plagiarism Case. In Proc. Int. Conf. on Enterprise Inform. Sys
work page 2014
-
[13]
Bela Gipp, Norman Meuschke, and Mario Lipinski. 2015. CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central. In Proc. iConference
work page 2015
-
[14]
Christian Grozea, Christian Gehl, and Marius Popescu. 2009. ENCOPLOT: Pair- wise Sequence Matching in Linear Time Applied to Plagiarism Detection. In Proc. PAN WS
work page 2009
-
[15]
Ferruccio Guidi and Claudio Sacerdoti Coen. 2016. A Survey on Retrieval of Mathematical Knowledge. Mathem. in Computer Science 10, 4 (2016), 409–427
work page 2016
-
[16]
D. Gupta, Vani K, and C. K. Singh. 2014. Using Natural Language Processing tech- niques and fuzzy-semantic similarity for automatic external plagiarism detection. In Proc. Int. Conf. on Advances in Computing, Communications and Informatics
work page 2014
-
[17]
Matthias Hagen, Martin Potthast, and Benno Stein. 2015. Source Retrieval for Plagiarism Detection from Large Web Corpora. In Proc. PAN WS
work page 2015
-
[18]
Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017. Detecting In-line Mathematical Expressions in Scientific Documents. In Proc. DocEng
work page 2017
-
[19]
Vani K and Deepa Gupta. 2015. Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In Proc. Int. Conf. on Advances in Computing, Communications and Informatics
work page 2015
-
[20]
Leilei Kong, Haoliang Qi, Cuixia Du, Mingxing Wang, and Zhongyuan Han. 2013. Approaches for Source Retrieval and Text Alignment of Plagiarism Detection. In Proc. PAN WS
work page 2013
-
[21]
Arun kumar Jayapal. 2012. Similarity Overlap Metric and Greedy String Tiling at PAN 2012. In Proc. PAN WS
work page 2012
-
[22]
Donald L. McCabe. 2005. Cheating among College and University Students: A North American Perspective. Int.J. for Academic Integrity 1, 1 (2005), 1–11
work page 2005
-
[23]
Norman Meuschke and Bela Gipp. 2013. State-of-the-art in detecting academic plagiarism. Int. J. for Educational Integrity (2013)
work page 2013
-
[24]
Norman Meuschke and Bela Gipp. 2014. Reducing Computational Effort for Plagiarism Detection by using Citation Characteristics to Limit Retrieval Space. In Proc. JCDL
work page 2014
-
[25]
Norman Meuschke, Christopher Gondek, Daniel Seebacher, Corinna Breitinger, Daniel A. Keim, and Bela Gipp. 2018. An Adaptive Image-based Plagiarism Detection Approach. In Proc. JCDL
work page 2018
-
[26]
Norman Meuschke, Moritz Schubotz, Felix Hamborg, Tomas Skopal, and Bela Gipp. 2017. Analyzing Mathematical Content to Detect Academic Plagiarism. In Proc. CIKM
work page 2017
-
[27]
Norman Meuschke, Nicolas Siebeck, Moritz Schubotz, and Bela Gipp. 2017. Ana- lyzing Semantic Concept Patterns to Detect Academic Plagiarism. In Proc. Int. WS on Mining Scientific Publ. (WOSP) at JCDL
work page 2017
-
[28]
Norman Meuschke, Vincent Stange, Moritz Schubotz, and Bela Gipp. 2018. Hy- Plag: A Hybrid Approach to Academic Plagiarism Detection. In Proc. SIGIR
work page 2018
-
[29]
H.F. Moed, W.J.M. Burger, J.G. Frankfort, and A.F.J. Van Raan. 1985. The applica- tion of bibliometric indicators: Important field- and time-dependent factors to be considered. 8, 3-4 (1985), 177–203
work page 1985
- [30]
-
[31]
Merin Paul and Sangeetha Jamal. 2015. An improved SRL based plagiarism detection technique using sentence ranking. Proc. CS 46 (2015), 223–230
work page 2015
-
[32]
Solange de L. Pertile, Viviane P. Moreira, and Paolo Rosso. 2016. Comparing and combining Content- and Citation-based approaches for plagiarism detection. JASIST 67, 10 (2016), 2511–2526
work page 2016
-
[33]
Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, and Benno Stein. 2012. Overview of the 4th Interna- tional Competition on Plagiarism Detection. In Proc. PAN WS
work page 2012
-
[34]
Martin Potthast, Benno Stein, Alberto Barrón Cedeño, and Paolo Rosso. 2010. An Evaluation Framework for Plagiarism Detection. In Proc. ACL
work page 2010
-
[35]
Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. J. of Univ. CS 8, 11 (2002), 1016
work page 2002
-
[36]
Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov
Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov. 2015. Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition. In Proc. CLEF (LNCS) , Vol. 9283
work page 2015
-
[37]
Cohl, Norman Meuschke, Bela Gipp, Abdou S
Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In Proc. SIGIR
work page 2016
-
[38]
Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, and Bela Gipp. 2019. Forms of Plagiarism in Digital Mathematical Libraries. In Proc. Int. Conf. on Intelligent Computer Mathematics
work page 2019
-
[39]
Petr Sojka and Martin Líška. 2011. Indexing and Searching Mathematics in Digital Libraries – Architecture, Design and Scalability Issues. In Proc. Int. Conf. on Intelligent Computer Mathematics (LNCS) , Vol. 6824
work page 2011
-
[40]
S. Soleman and A. Purwarianti. 2014. Experiments on the Indonesian plagiarism detection using latent semantic analysis. In Int. Conf. on ICT
work page 2014
-
[41]
Benno Stein, Sven Meyer zu Eissen, and Martin Potthast. 2007. Strategies for Retrieving Plagiarized Documents. In Proc. SIGIR
work page 2007
-
[42]
Dominika Tkaczyk, PawełSzostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Meta- data from Scientific Literature. Int. J. Doc. Anal. Recognit. 18, 4 (2015), 317–335
work page 2015
-
[43]
Juan D. Velásquez, Yerko Covacevich, Francisco Molina, Edison Marrese-Taylor, Cristián Rodríguez, and Felipe Bravo-Marquez. 2016. DOCODE 3.0 (DOcument COpy DEtector). Information Fusion 27 (2016)
work page 2016
-
[44]
Debora Weber-Wulff. 2014. False Feathers: A Perspective on Academic Plagiarism
work page 2014
-
[45]
Michael J. Wise. 1993. String Similarity via Greedy String Tiling and Running Karp-Rabin Matching. TR (Univ. of Sydney. Basser Dept. of CS) 463. Improving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA Listing 1: Use the following BibTeX code to cite this article @inproceedings { Meuschke2019 , a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.