Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Andrea Gurioli; Davide D'Ascenzo; Federico Pennino; Maurizio Gabbrielli; Stefano Zacchiroli

arxiv: 2605.28510 · v2 · pith:2H27RUFTnew · submitted 2026-05-27 · 💻 cs.SE · cs.AI· cs.IR

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Andrea Gurioli , Davide D'Ascenzo , Federico Pennino , Maurizio Gabbrielli , Stefano Zacchiroli This is my paper

Pith reviewed 2026-06-29 10:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.IR

keywords provenance trackingLLM-generated codefingerprintingvector searchcode retrievalWinnowinghybrid pipelineplagiarism detection

0 comments

The pith

A hybrid vector search plus fingerprinting pipeline tracks LLM code snippet origins at scale with logarithmic query time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that combining fast vector search to narrow candidates with precise fingerprint matching can make provenance tracking practical for code produced by large language models. This addresses the problem that traditional fingerprint methods like Winnowing work well but scale linearly and thus fail on the billion-snippet corpora used to train modern code LLMs. The hybrid system reaches mean reciprocal rank on par with pure Winnowing for 30-token adapted fragments and improves by up to 5.4 percent for 60-token and longer windows while keeping query time logarithmic. An additional LLM-based judge evaluation shows that many non-ground-truth matches remain highly similar to expected sources, especially with longer windows.

Core claim

HYBRIDSOURCETRACKER first narrows candidates via vector search with a 300M-parameter code encoder, then re-ranks those candidates with Winnowing fingerprinting. On an in vitro 100k-snippet search space drawn from a 10M-snippet THESTACKV2 subset and using adapted queries that emulate identifier renaming, the hybrid reaches mean reciprocal rank comparable to Winnowing for 30-token fragments and outperforms it by up to 5.4 percent once windows reach 60 tokens or more, all while preserving logarithmic query complexity.

What carries the argument

HYBRIDSOURCETRACKER (HST), the two-stage pipeline that uses vector search for candidate selection followed by Winnowing fingerprint re-ranking.

If this is right

Provenance tracking for verbatim and adapted code snippets becomes feasible on billion-scale training corpora.
Query complexity stays logarithmic rather than linear, enabling checks on modern LLM training sets.
Performance gains over pure fingerprinting appear once code fragments reach 60 tokens or longer.
An LLM judge finds many additional retrieved snippets remain usefully similar to expected sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding the pipeline inside code-completion tools could surface source attributions to developers in real time.
The same two-stage pattern could be tested on provenance for generated text or other non-code artifacts.
Raising first-stage recall would improve end-to-end accuracy without altering the fingerprint stage.

Load-bearing premise

The first-stage vector search must retrieve the true source inside its small candidate set with high enough recall that the second-stage fingerprinting can still succeed.

What would settle it

Measure the first-stage recall directly by counting how often the ground-truth source appears among the top-k vector candidates returned for the 100k-snippet adapted-query test set.

Figures

Figures reproduced from arXiv: 2605.28510 by Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli.

**Figure 2.** Figure 2: Incremental representation of different window sizes, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: The figures illustrate the Recall@1 (R@1) across window sizes ranging from 7 to 480 tokens with three different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: MRR (%) comparison across different window sizes between S [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qdrant query latency performance across different collection sizes and top-k values. The plots demonstrate consistent [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Single-query latency of the MOSS system (Winnow [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Results of the LLM error similarity evaluation: The left figure illustrates how different clone type errors (verbatim [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hybrid method offers a practical way to scale provenance checks but the reported gains rest on an unmeasured assumption about vector search recall.

read the letter

Hi colleague,

The one thing to know is that this work combines a custom 300M-parameter code encoder with a two-stage pipeline: vector search to get a shortlist, then classical Winnowing on fingerprints. On their 100k snippet test bed with adapted queries, it matches Winnowing at 30 tokens and improves MRR by up to 5.4% for 60+ token windows, all at log time.

They do a good job framing the problem of provenance for LLM-generated code, where you need to check against huge training sets without linear scans. The use of both exact and identifier-renamed snippets from TheStackV2 makes the test more realistic than pure verbatim cases. The LLM judge evaluation is an extra angle that shows some non-ground-truth hits are still close, which could matter for users.

Where it falls short is the lack of numbers on the vector stage itself. The hybrid claim only works if the encoder's search has high enough recall to include the true source in the small candidate pool for Winnowing to catch. Without recall@K, candidate set size, or an ablation that turns off the second stage, the 5.4% edge could be an artifact of the in-vitro 100k space rather than a general win. The abstract also skips training hyperparameters, data splits, and any error bars, which makes the soundness harder to judge from what's here.

No obvious circularity or invented results; it's all empirical.

This is aimed at the software engineering and AI ethics crowd who care about code ownership and license issues in generated output. Someone building or auditing code models would find the pipeline idea useful.

I would send it to peer review. The idea is worth referee time, but reviewers will need to see the first-stage metrics and more eval details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SOURCETRACKER, a 300M-parameter encoder for code retrieval, and HYBRIDSOURCETRACKER (HST), a hybrid two-stage provenance-tracking pipeline. HST first narrows candidates via vector search then re-ranks with Winnowing fingerprinting. It is trained and evaluated on a 10M-snippet subset of THESTACKV2 using both verbatim and adapted (identifier-renamed) queries. On an in-vitro 100k-snippet search space with adapted queries, the hybrid reaches MRR on par with Winnowing for 30-token fragments and outperforms by up to 5.4% for windows >=60 tokens while preserving logarithmic query time. A complementary LLM-based judge evaluation finds many non-ground-truth retrieved snippets remain highly similar to expected sources.

Significance. If the results hold after the requested clarifications, the work supplies a concrete, scalable method for provenance tracking of LLM-generated code that combines the efficiency of vector search with the precision of classical fingerprinting. This directly addresses practical needs around license compliance and attribution in billion-scale code corpora. The evaluation on real THESTACKV2 data with realistic adaptations, the explicit comparison to Winnowing, and the dual use of MRR plus LLM-judge similarity constitute measurable strengths.

major comments (2)

[Evaluation] The reported MRR gains (on par at 30 tokens, +5.4% at >=60 tokens) rest on the premise that the first-stage 300M-parameter vector search returns the true source inside its small candidate set with high recall; the Evaluation section supplies only end-to-end MRR and LLM-judge scores and contains no recall@K, candidate-set size, or ablation isolating the vector-search stage. This omission leaves the source of the hybrid advantage unverified.
[Evaluation] The complementary LLM-judge evaluation asserts that many non-ground-truth retrieved snippets are still highly similar, yet no inter-annotator agreement, correlation with ground-truth labels, or quantification of judge subjectivity is provided. This weakens the claim that such snippets “remain useful for end users.”

minor comments (2)

The abstract states results on a “10M-snippet subset” and a “100k-snippet search space” but does not specify the sampling procedure or whether the 100k set is disjoint from training data.
No error bars, number of runs, or statistical significance tests accompany the reported MRR figures or the 5.4% improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Evaluation] The reported MRR gains (on par at 30 tokens, +5.4% at >=60 tokens) rest on the premise that the first-stage 300M-parameter vector search returns the true source inside its small candidate set with high recall; the Evaluation section supplies only end-to-end MRR and LLM-judge scores and contains no recall@K, candidate-set size, or ablation isolating the vector-search stage. This omission leaves the source of the hybrid advantage unverified.

Authors: We agree that the current presentation leaves the contribution of the vector-search stage implicit. In the revised manuscript we will report recall@K for the first-stage retrieval (at multiple K values matching the candidate-set size used in the hybrid), explicitly state the candidate-set size, and add an ablation that compares end-to-end MRR of the hybrid against the vector-search stage alone and against Winnowing alone. These additions will directly verify the premise underlying the reported gains. revision: yes
Referee: [Evaluation] The complementary LLM-judge evaluation asserts that many non-ground-truth retrieved snippets are still highly similar, yet no inter-annotator agreement, correlation with ground-truth labels, or quantification of judge subjectivity is provided. This weakens the claim that such snippets “remain useful for end users.”

Authors: We acknowledge that the LLM-judge results would be more robust with explicit validation metrics. We will expand the evaluation section to describe the judge prompt in full, report any correlation that can be computed between judge scores and ground-truth labels on the subset where both are available, and add a limitations paragraph quantifying the single-judge setup and potential subjectivity. If inter-annotator agreement cannot be obtained without new experiments, we will clearly qualify the claim accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper describes training a 300M-parameter encoder on THESTACKV2 and evaluating a hybrid vector-search + Winnowing pipeline on held-out 100k-snippet queries, reporting end-to-end MRR and LLM-judge similarity. No equations, first-principles derivations, or predictions appear in the abstract or described claims. The central results are direct measurements on external data rather than quantities that reduce by construction to fitted inputs or self-citation chains. Self-citations, if present in the full text, are not load-bearing for the reported performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or postulated physical entities; the contribution consists of an empirical system whose internal hyperparameters and training choices are not detailed.

pith-pipeline@v0.9.1-grok · 5862 in / 1212 out tokens · 45422 ms · 2026-06-29T10:43:03.234689+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 7 canonical work pages · 4 internal anchors

[1]

El Tahir Ali, Hussam M

Asim M. El Tahir Ali, Hussam M. Dahwa Abdulla, and V ´aclav Sn´asel. Overview and comparison of plagiarism detection tools. In V ´aclav Sn´asel, Jaroslav Pokorn ´y, and Karel Richta, editors,Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, April 20, 2011, volume 706 ofCEU...

2011
[2]

Ethical implications of chatgpt and other large language models in academia.Frontiers in Artificial Intelligence, 8:1615761, 2025

Ahmad Almufarreh, Ashfaq Ahmad, Muhammad Arshad, Choo Wou Onn, and Robinson Elechi. Ethical implications of chatgpt and other large language models in academia.Frontiers in Artificial Intelligence, 8:1615761, 2025

2025
[3]

code2vec: learning distributed representations of code.Proc

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: learning distributed representations of code.Proc. ACM Program. Lang., 3(POPL):40:1–40:29, 2019

2019
[4]

Do llms provide links to code similar to what they generate? A study with gemini and bing copilot

Daniele Bifolco, Pietro Cassieri, Giuseppe Scanniello, Massimiliano Di Penta, and Fiorella Zampetti. Do llms provide links to code similar to what they generate? A study with gemini and bing copilot. In22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025, pages 223–

2025
[5]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram `er, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond ´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Software heritage: Why and how to preserve software source code

Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to preserve software source code. In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors,Proceedings of the 14th Interna- tional Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017, 2017

2017
[8]

Mose: Hierarchical self-distillation enhances early layer embed- dings, 2025

Andrea Gurioli, Federico Pennino, Jo ˜ao Monteiro, and Maurizio Gab- brielli. Mose: Hierarchical self-distillation enhances early layer embed- dings, 2025

2025
[9]

Lemley, and Percy Liang

Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang. Foundation models and fair use. J. Mach. Learn. Res., 24:400:1–400:79, 2023

2023
[10]

Large language models for software engineering: A systematic literature review.ACM Trans

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 33(8):220:1–220:79, 2024

2024
[11]

Plagbench: Exploring the duality of large language models in plagiarism generation and detection.CoRR, abs/2406.16288, 2024

Jooyoung Lee, Toshini Agrawal, Adaku Uchendu, Thai Le, Jinghui Chen, and Dongwon Lee. Plagbench: Exploring the duality of large language models in plagiarism generation and detection.CoRR, abs/2406.16288, 2024

work page arXiv 2024
[12]

Do language models plagiarize? In Ying Ding, Jie Tang, Juan F

Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben, editors,Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3637–3647. ACM, 2023

2023
[13]

Lemley and Bryan Casey

Mark A. Lemley and Bryan Casey. Fair learning.Texas Law Review, 99(4):743–804, 2021. Available at SSRN: https://ssrn.com/abstract= 3528447

2021
[14]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...

2023
[15]

Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, and Jesse Dodge

Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Y...

work page arXiv 2025
[16]

Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2025

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2025

2025
[17]

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approx- imate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

2020
[19]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of...

2024
[20]

Can watermarking large language models prevent copyrighted text generation and hide training data?CoRR, abs/2407.17417, 2024

Michael-Andrei Panaitescu-Liess, Zora Che, Bang An, Yuancheng Xu, Pankayaraj Pathmanathan, Souradip Chakraborty, Sicheng Zhu, Tom Goldstein, and Furong Huang. Can watermarking large language models prevent copyrighted text generation and hide training data?CoRR, abs/2407.17417, 2024

work page arXiv 2024
[21]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydl ´ıcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neura...

2024
[22]

Qdrant: Vector search engine, 2025

Qdrant Team. Qdrant: Vector search engine, 2025

2025
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021
[24]

A survey on software clone detection research.Queen’s School of computing TR, 541(115):64–68, 2007

Chanchal Kumar Roy and James R Cordy. A survey on software clone detection research.Queen’s School of computing TR, 541(115):64–68, 2007

2007
[25]

Code Llama: Open Foundation Models for Code

Baptiste Rozi `ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J ´er´emy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre 14 D´efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usuni...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Winnow- ing: Local algorithms for document fingerprinting

Saul Schleimer, Daniel Shawcross Wilkerson, and Alex Aiken. Winnow- ing: Local algorithms for document fingerprinting. In Alon Y . Halevy, Zachary G. Ives, and AnHai Doan, editors,Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76–85. ACM, 2003

2003
[27]

Jeffrey Svajlenko and Chanchal K. Roy. Evaluating clone detection tools with bigclonebench. In Rainer Koschke, Jens Krinke, and Martin P. Robillard, editors,2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, pages 131–140. IEEE Computer Society, 2015

2015
[28]

Kakade, and Boaz Barak

Nikhil Vyas, Sham M. Kakade, and Boaz Barak. On provable copyright protection for generative models. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofPro- ceedings of Machine Lear...

2023
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

GitHub Copilot research recitation — parrot or crow? a first look at rote learning in GitHub Copilot suggestions

Albert Ziegler. GitHub Copilot research recitation — parrot or crow? a first look at rote learning in GitHub Copilot suggestions. https://github. blog/ai-and-ml/github-copilot/github-copilot-research-recitation/, 2021. Accessed: 2024-09-25

2021

[1] [1]

El Tahir Ali, Hussam M

Asim M. El Tahir Ali, Hussam M. Dahwa Abdulla, and V ´aclav Sn´asel. Overview and comparison of plagiarism detection tools. In V ´aclav Sn´asel, Jaroslav Pokorn ´y, and Karel Richta, editors,Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, April 20, 2011, volume 706 ofCEU...

2011

[2] [2]

Ethical implications of chatgpt and other large language models in academia.Frontiers in Artificial Intelligence, 8:1615761, 2025

Ahmad Almufarreh, Ashfaq Ahmad, Muhammad Arshad, Choo Wou Onn, and Robinson Elechi. Ethical implications of chatgpt and other large language models in academia.Frontiers in Artificial Intelligence, 8:1615761, 2025

2025

[3] [3]

code2vec: learning distributed representations of code.Proc

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: learning distributed representations of code.Proc. ACM Program. Lang., 3(POPL):40:1–40:29, 2019

2019

[4] [4]

Do llms provide links to code similar to what they generate? A study with gemini and bing copilot

Daniele Bifolco, Pietro Cassieri, Giuseppe Scanniello, Massimiliano Di Penta, and Fiorella Zampetti. Do llms provide links to code similar to what they generate? A study with gemini and bing copilot. In22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025, pages 223–

2025

[5] [5]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram `er, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond ´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Software heritage: Why and how to preserve software source code

Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to preserve software source code. In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors,Proceedings of the 14th Interna- tional Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017, 2017

2017

[8] [8]

Mose: Hierarchical self-distillation enhances early layer embed- dings, 2025

Andrea Gurioli, Federico Pennino, Jo ˜ao Monteiro, and Maurizio Gab- brielli. Mose: Hierarchical self-distillation enhances early layer embed- dings, 2025

2025

[9] [9]

Lemley, and Percy Liang

Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang. Foundation models and fair use. J. Mach. Learn. Res., 24:400:1–400:79, 2023

2023

[10] [10]

Large language models for software engineering: A systematic literature review.ACM Trans

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 33(8):220:1–220:79, 2024

2024

[11] [11]

Plagbench: Exploring the duality of large language models in plagiarism generation and detection.CoRR, abs/2406.16288, 2024

Jooyoung Lee, Toshini Agrawal, Adaku Uchendu, Thai Le, Jinghui Chen, and Dongwon Lee. Plagbench: Exploring the duality of large language models in plagiarism generation and detection.CoRR, abs/2406.16288, 2024

work page arXiv 2024

[12] [12]

Do language models plagiarize? In Ying Ding, Jie Tang, Juan F

Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben, editors,Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3637–3647. ACM, 2023

2023

[13] [13]

Lemley and Bryan Casey

Mark A. Lemley and Bryan Casey. Fair learning.Texas Law Review, 99(4):743–804, 2021. Available at SSRN: https://ssrn.com/abstract= 3528447

2021

[14] [14]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...

2023

[15] [15]

Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, and Jesse Dodge

Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Y...

work page arXiv 2025

[16] [16]

Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2025

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2025

2025

[17] [17]

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approx- imate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

2020

[19] [19]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of...

2024

[20] [20]

Can watermarking large language models prevent copyrighted text generation and hide training data?CoRR, abs/2407.17417, 2024

Michael-Andrei Panaitescu-Liess, Zora Che, Bang An, Yuancheng Xu, Pankayaraj Pathmanathan, Souradip Chakraborty, Sicheng Zhu, Tom Goldstein, and Furong Huang. Can watermarking large language models prevent copyrighted text generation and hide training data?CoRR, abs/2407.17417, 2024

work page arXiv 2024

[21] [21]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydl ´ıcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neura...

2024

[22] [22]

Qdrant: Vector search engine, 2025

Qdrant Team. Qdrant: Vector search engine, 2025

2025

[23] [23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021

[24] [24]

A survey on software clone detection research.Queen’s School of computing TR, 541(115):64–68, 2007

Chanchal Kumar Roy and James R Cordy. A survey on software clone detection research.Queen’s School of computing TR, 541(115):64–68, 2007

2007

[25] [25]

Code Llama: Open Foundation Models for Code

Baptiste Rozi `ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J ´er´emy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre 14 D´efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usuni...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Winnow- ing: Local algorithms for document fingerprinting

Saul Schleimer, Daniel Shawcross Wilkerson, and Alex Aiken. Winnow- ing: Local algorithms for document fingerprinting. In Alon Y . Halevy, Zachary G. Ives, and AnHai Doan, editors,Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76–85. ACM, 2003

2003

[27] [27]

Jeffrey Svajlenko and Chanchal K. Roy. Evaluating clone detection tools with bigclonebench. In Rainer Koschke, Jens Krinke, and Martin P. Robillard, editors,2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, pages 131–140. IEEE Computer Society, 2015

2015

[28] [28]

Kakade, and Boaz Barak

Nikhil Vyas, Sham M. Kakade, and Boaz Barak. On provable copyright protection for generative models. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofPro- ceedings of Machine Lear...

2023

[29] [29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

GitHub Copilot research recitation — parrot or crow? a first look at rote learning in GitHub Copilot suggestions

Albert Ziegler. GitHub Copilot research recitation — parrot or crow? a first look at rote learning in GitHub Copilot suggestions. https://github. blog/ai-and-ml/github-copilot/github-copilot-research-recitation/, 2021. Accessed: 2024-09-25

2021