arxiv: 2605.00087 · v1 · submitted 2026-04-30 · 💻 cs.NI · cs.AI· cs.CY· cs.IR· cs.LG

Recognition: unknown

DeGenTWeb: A First Look at LLM-dominant Websites

Calvin Ardi, Harsha V. Madhyastha, Ramesh Govindan, Sichang Steven He

Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.CYcs.IRcs.LG

keywords LLM-generated contentweb measurementcontent detectionCommon Crawlsearch resultsAI content prevalencewebsite classificationLLM detection

0 comments

The pith

DeGenTWeb finds LLM-dominant websites prevalent and growing in Common Crawl and Bing search results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DeGenTWeb to systematically identify websites whose content is mostly generated by large language models with little human input. It adapts detectors of LLM-generated text to handle individual web pages and aggregates those results across a site's pages to classify the whole site. Applying the method to Common Crawl data and Bing search results shows these sites form a large share that is increasing over time. This supplies the first representative measurement of how much LLM content exists on the web, moving beyond unverified news claims.

Core claim

DeGenTWeb systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. It adapts detectors of LLM-generated text for use on web pages and aggregates detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb on Common Crawl and Bing data shows LLM-dominant sites are highly prevalent and their share is growing over time, though accurate identification appears challenging with the latest LLMs.

What carries the argument

DeGenTWeb, a system that adapts LLM-generated text detectors to web pages and aggregates per-page results at the site level to classify LLM-dominant sites.

If this is right

LLM-generated content constitutes a large and increasing fraction of the web as captured in public crawls.
Search engine results contain a substantial and growing number of LLM-dominant sites.
Detection accuracy for such sites will decline further as newer LLMs improve at evading detectors.
Continued reliance on web data for AI training will incorporate more synthetic content over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Search engines may need to develop labeling or filtering methods for LLM-dominant results to maintain user trust.
If the trend continues, future web-trained models could face degraded performance from ingesting mostly synthetic data.
Independent manual sampling of classified sites could serve as a low-cost check on the automated method's error rates.

Load-bearing premise

Detectors of LLM-generated text, after adaptation to web pages and site-level aggregation, can reliably separate LLM-dominant sites from human-authored ones despite performing much worse than advertised when minimizing false positives.

What would settle it

A manual audit of hundreds of sites labeled LLM-dominant by DeGenTWeb that finds most were primarily human-authored, or a similar audit of sites labeled human-authored that finds most were LLM-dominant.

Figures

Figures reproduced from arXiv: 2605.00087 by Calvin Ardi, Harsha V. Madhyastha, Ramesh Govindan, Sichang Steven He.

**Figure 2.** Figure 2: Distribution of per-page Binoculars scores [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Rise of LLM-dominant sites in CC Dataset. Crawl’s data. We "nd that DeGenTWeb’s FPR without page "ltering is 12.9%, as compared to 0.30% with "ltering. This shows that our "ltering is both necessary and e#ective. How cost-e#ective is DeGenTWeb’s use at scale? In our experiments, we run Binoculars on an H100 GPU rented from modal.com, after lossless FP8 quantization of the pair of Falcon-7B LLMs it uses. Cl… view at source ↗

**Figure 5.** Figure 5: Category breakdown of LLM-(dominant) and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: DeGenTWeb’s accuracy on sites generated by [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeGenTWeb applies site-level aggregation to existing LLM detectors on Common Crawl and Bing data and reports high and growing prevalence, but the authors' own note on poor detector performance at low false-positive rates undercuts the reliability of those numbers.

read the letter

The main takeaway is that this paper builds DeGenTWeb to classify LLM-dominant sites by adapting text detectors for web pages and aggregating results across multiple pages per site, then applies it to Common Crawl snapshots and Bing results to claim such sites are common and increasing over time. The site-level framing and use of representative web corpora are the concrete advances here. Moving from single-page detection to whole-site decisions is a sensible response to how content actually appears online, and pulling from established crawl and search data avoids the selection bias in many news stories. The adaptation steps for web pages and the aggregation logic are described clearly enough to be reproducible by others. The soft spot is validation. The abstract states outright that detectors perform much worse than advertised when the goal is minimizing false positives, which is the regime needed for trustworthy prevalence estimates. Yet there is no mention of ground-truth labels, measured error rates on the site-level outputs, or checks on tricky cases such as mixed human-LLM pages or lightly edited text. Without that, false positives could easily drive the reported shares and growth trends. The final note that identification is getting harder with newer models is fair but also highlights why stronger evidence is needed now. This is useful reading for anyone measuring web content quality, training-data provenance, or search-engine behavior. The framework gives a workable starting point even if the exact percentages should be treated as preliminary. It is worth sending to peer review so referees can press on the validation gaps and see whether the adaptation and aggregation actually lift precision enough to support the claims.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DeGenTWeb, a framework for identifying LLM-dominant websites (those with content generated primarily by LLMs with little human input). It describes adaptations of existing LLM-generated text detectors for web pages, site-level aggregation of per-page detections, and applies the method to Common Crawl snapshots and Bing search results. The central claims are that LLM-dominant sites are highly prevalent in both datasets and that their share has grown over time, while noting that accurate identification remains challenging with current detectors.

Significance. If the detection pipeline can be shown to achieve reliable precision, the work would supply the first large-scale, systematic measurement of LLM-generated web content, moving beyond anecdotal news reports. The use of representative web corpora (Common Crawl) and search-engine results is a methodological strength that could inform future studies on content provenance and search quality.

major comments (3)

[Abstract] Abstract: The headline claims that LLM-dominant sites are 'highly prevalent' in Common Crawl and Bing results and that 'this share is growing over time' rest on the assumption that the adapted detectors plus site-level aggregation produce sufficiently low false-positive rates. The abstract itself states that detectors 'perform much worse than advertised' when minimizing false positives, yet the manuscript provides no quantitative validation—no precision-recall figures, no error rates on human-labeled site ground truth, and no comparison of site-level decisions against manual annotation. This absence directly undermines the prevalence and growth conclusions.
[Detector adaptation and aggregation] Section describing detector adaptation and aggregation (likely §3): The paper asserts that adaptation to web pages and multi-page aggregation enable 'accurate site-level categorization.' However, no ablation or sensitivity analysis is shown demonstrating that these steps raise precision above the baseline low-FP failure mode acknowledged in the abstract. Without such evidence (e.g., performance on mixed-content or post-edited pages), false positives could systematically inflate the reported shares.
[Results] Results section (likely §4): The reported prevalence figures and temporal trends lack accompanying confidence intervals, threshold-sensitivity tests, or error analysis on edge cases such as sites with both human and LLM content. Given the acknowledged detector weaknesses, these omissions make it impossible to assess whether the 'growing' trend is robust or an artifact of changing detector behavior.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly quantified the scale of the Common Crawl and Bing samples used.
[Methodology] Notation for site-level aggregation (e.g., how per-page scores are combined) should be formalized with a short equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The feedback correctly identifies areas where additional validation and analysis would strengthen the claims. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses

Referee: [Abstract] The headline claims that LLM-dominant sites are 'highly prevalent' in Common Crawl and Bing results and that 'this share is growing over time' rest on the assumption that the adapted detectors plus site-level aggregation produce sufficiently low false-positive rates. The abstract itself states that detectors 'perform much worse than advertised' when minimizing false positives, yet the manuscript provides no quantitative validation—no precision-recall figures, no error rates on human-labeled site ground truth, and no comparison of site-level decisions against manual annotation. This absence directly undermines the prevalence and growth conclusions.

Authors: We agree that the absence of site-level ground-truth validation limits the strength of the prevalence claims. Our estimates use conservative thresholds chosen specifically to minimize false positives, and we report consistent trends across multiple detectors. To directly address this, the revised manuscript will add a validation subsection that manually annotates a random sample of 200 sites (100 from each dataset) to estimate precision and characterize false-positive cases. revision: yes
Referee: [Detector adaptation and aggregation] The paper asserts that adaptation to web pages and multi-page aggregation enable 'accurate site-level categorization.' However, no ablation or sensitivity analysis is shown demonstrating that these steps raise precision above the baseline low-FP failure mode acknowledged in the abstract. Without such evidence (e.g., performance on mixed-content or post-edited pages), false positives could systematically inflate the reported shares.

Authors: The adaptation filters non-text content and aggregates page-level scores via averaging or majority vote across sampled pages. While the original submission did not include explicit ablations, we will add sensitivity analyses that vary the number of pages per site and the aggregation threshold, plus evaluation on a set of known mixed-content sites, to quantify the improvement over single-page detection. revision: yes
Referee: [Results] The reported prevalence figures and temporal trends lack accompanying confidence intervals, threshold-sensitivity tests, or error analysis on edge cases such as sites with both human and LLM content. Given the acknowledged detector weaknesses, these omissions make it impossible to assess whether the 'growing' trend is robust or an artifact of changing detector behavior.

Authors: We will augment the results section with bootstrap confidence intervals for all prevalence estimates and add threshold-sensitivity plots. We will also include a qualitative discussion of mixed-content sites, noting that our conservative detection strategy tends to classify borderline cases as non-LLM-dominant. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study using external detectors and data

full rationale

The paper describes an empirical methodology (DeGenTWeb) that adapts off-the-shelf LLM-generated text detectors for web pages and aggregates per-page results to site-level labels, then applies the resulting classifier to independent external corpora (Common Crawl snapshots and Bing search results) to report prevalence and temporal trends. No equations, fitted parameters, or self-referential derivations are present; the central claims are direct measurements on public data rather than predictions derived from the method's own outputs or prior self-citations. The approach is therefore self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no mathematical derivations, fitted parameters, or new entities are visible. The central claim rests on the unverified assumption that adapted detectors plus aggregation produce accurate site labels.

axioms (1)

domain assumption Existing LLM-generated text detectors can be adapted to web pages and aggregated to produce reliable site-level labels for LLM-dominant content.
The paper's method depends on this adaptation working despite acknowledged poor performance of the detectors.

pith-pipeline@v0.9.0 · 5513 in / 1392 out tokens · 32784 ms · 2026-05-09T20:51:04.970623+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 6 canonical work pages · 3 internal anchors

[1]

[n. d.]. AI Website Builder - Create a Website in Minutes | Wix. https: //www.wix.com/ai-website-builder
[2]

[n. d.]. B12 | The easiest AI website builder. https://www.b12.io/
[3]

Zainab Ahmad, Miguel Torres-Ruiz, Ahmad Mahmood, Rolando Quin- tero, Iqra Ameer, and Necva Bölücü. 2026. Human or Machine? A Survey on Machine-Generated Text Detection. IEEE Access (2026)

2026
[4]

Ahrefs. 2025. 74% of New Webpages Include AI Content (Study of 900k Pages). https://ahrefs.com/blog/what-percentage-of-new-content-is- ai-generated/

2025
[5]

AnnualReports.com. 2025. Russell 2000 Index Companies. https: //www.annualreports.com/FeaturedProgram/15

2025
[6]

Calvin Ardi and John Heidemann. 2019. Precise Detection of Content Reuse in the Web. (2019)

2019
[7]

Articial Analysis. 2025. Articial Analysis Intelligence Index. https:// articialanalysis.ai/evaluations/articial-analysis-intelligence-index

2025
[8]

Stefan Baack. 2024. A critical analysis of the largest source for gener- ative ai training data: Common crawl. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency

2024
[9]

Sankalp Bahad, Yash Bhaskar, and Parameswari Krishnamurthy. 2024. Fine-tuning Language Models for AI vs Human Generated Text de- tection. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). Association for Computational Linguistics

2024
[10]

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast-DetectGPT: Ecient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. In The Twelfth International Conference on Learning Representations

2024
[11]

Adrien Barbaresi. 2020. htmldate: A Python package to extract publi- cation dates from web pages. Journal of Open Source Software (2020)

2020
[12]

Adrien Barbaresi. 2021. Tralatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations . Association for Computatio...

2021
[13]

Janek Bevendor, Sanket Gupta, Johannes Kiesel, and Benno Stein
[14]

In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

An Empirical Comparison of Web Content Extraction Algo- rithms. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
[15]

Janek Bevendor, Matti Wiegmann, Martin Potthast, and Benno Stein
[16]

In European Conference on Information Re- trieval

Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines. In European Conference on Information Re- trieval. Springer
[17]

Janek Bevendor, Matti Wiegmann, Emmelie Richter, Martin Potthast, and Benno Stein. 2025. The Two Paradigms of LLM Detection: Au- thorship Attribution vs Authorship Verication. In Findings of the Association for Computational Linguistics: ACL 2025

2025
[18]

Bynder. 2024. Study reveals how consumers interact with AI- generated content vs human-made. Bynder Press and Media (03 April 2024). https://www.bynder.com/en/press-media/ai-vs-human-made- content-study/

2024
[19]

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

2007
[20]

Common Crawl. 2025. Common Crawl. https://commoncrawl.org/

2025
[21]

Giulio Corsi, Bill Marino, and Willow Wong. 2024. The spread of synthetic media on X. Harvard Kennedy School Misinformation Review (2024)

2024
[22]

Cristian Danescu-Niculescu-Mizil, Andrei Z Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2010. Competing for users’ attention: on the interplay between organic and sponsored search results. In Proceedings of the 19th international conference on World wide web

2010
[23]

Ye Du, Yaoyun Shi, and Xin Zhao. 2007. Using spam farm to boost PageRank. In Proceedings of the 3rd international workshop on Adver- sarial information retrieval on the web

2007
[24]

Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Mag- nus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch
[25]

RAID: A shared benchmark for robust evalua- tion of machine-generated text detectors

RAID: A Shared Benchmark for Robust Evaluation of Machine- Generated Text Detectors. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Association for Com- putational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.674

work page doi:10.18653/v1/2024.acl-long.674 2024
[26]

Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, and Min Zhang. 2025. Your language model can secretly write like humans: Contrastive paraphrase attacks on llm-generated text detectors. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

2025
[27]

Charles Floate. 2015. A Beginners Guide to Buying SAPE Links . https: //www.charlesoate.com/sape-links Accessed 2026-04-28

2015
[28]

Joel Frank, Franziska Herbert, Jonas Ricker, Lea Schönherr, Thorsten Eisenhofer, Asja Fischer, Markus Dürmuth, and Thorsten Holz. 2024. A representative study on human detection of articially generated me- dia across countries. In 2024 IEEE Symposium on Security and Privacy (SP). Ieee

2024
[29]

Graphite.io. 2025. More Articles Are Now Created by AI Than Humans. https://graphite.io/ve-percent/more-articles-are-now- created-by-ai-than-humans

2025
[30]

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, and Baosheng Wang. 2025. League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models. arXiv preprint arXiv:2507.22359 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web Spam Taxon- omy. In AIRWeb

2005
[32]

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. In Proceedings of the 41st International Conference on Machine Learning

2024
[33]

Wei Hao, Ran Li, Weiliang Zhao, Junfeng Yang, and Chengzhi Mao
[34]

In Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics

Learning to rewrite: Generalized llm-generated text detection. In Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics
[35]

Wei Hao, Van Tran, Vincent Rideout, Zixi Wang, AnMei Dasbach- Prisk, MH A, Junfeng Yang, Ethan Katz-Bassett, Grant Ho, and Asaf Cidon. 2025. Do spammers dream of electric sheep? characterizing the prevalence of llm-generated malicious emails. In Proceedings of the 2025 ACM Internet Measurement Conference

2025
[36]

Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bern- hard Scholkopf

Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bern- hard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications (1998)

1998
[37]

Benjamin Homan. 2024. First Came ‘Spam. ’ Now, With A.I., We’ve Got ‘Slop’. The New York Times (2024). https://www.nytimes.com/ 2024/06/11/style/ai-search-slop.html

2024
[38]

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. [n. d.]. RADAR: Robust AI-Text Detection via Adversarial Learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc
[39]

Yifei Huang, Jiuxin Cao, Hanyu Luo, Xin Guan, and Bo Liu. 2025. Magret: Machine-generated text detection with rewritten texts. In 7 Proceedings of the 31st International Conference on Computational Lin- guistics

2025
[40]

IndieWeb community. 2025. IndieWeb. https://indieweb.org/

2025
[41]

Nikhil Jha, Martino Trevisan, Luca Vassio, and Marco Mellia. 2022. The Internet with Privacy Policies: Measuring The Web Upon Consent. (2022)

2022
[42]

Kaizer and Minaxi Gupta

Andrew J. Kaizer and Minaxi Gupta. 2016. Characterizing Website Behaviors Across Logged-in and Not-logged-in Users. Association for Computing Machinery

2016
[43]

Kiwix. 2023. wikihow_en_maxi_2023-03.zim : WikiHow : Free Down- load, Borrow, and Streaming : Internet Archive. https://archive.org/ details/wiki-how-en

2023
[44]

Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, and Jan Lause. 2025. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances (2025)

2025
[45]

Ryuto Koike, Liam Dugan, Masahiro Kaneko, Chris Callison-Burch, and Naoaki Okazaki. 2025. Machine Text Detectors are Membership Inference Attacks. arXiv preprint arXiv:2510.19492 (2025)

work page arXiv 2025
[46]

Kayvan Kousha and Mike Thelwall. 2026. How much are LLMs chang- ing the language of academic papers after ChatGPT? A multi-database and full text analysis. Scientometrics (2026)

2026
[47]

Lucio La Cava, Luca Maria Aiello, and Andrea Tagarelli. 2025. Ma- chines in the crowd? measuring the footprint of machine-generated text on reddit. arXiv preprint arXiv:2510.07226 (2025)

work page arXiv 2025
[48]

Nathan Lambert and Florian Brand. 2026. The ATOM Report: Measuring the Open Language Model Ecosystem. arXiv preprint arXiv:2604.07190 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Luke Leal. 2020. Hidden SEO Spam Link Injections on WordPress Sites . Sucuri Blog. https://blog.sucuri.net/2020/11/hidden-seo-spam-link- injections-on-wordpress-sites.html Accessed 2026-04-28

2020
[50]

Dirk Lewandowski, Sebastian Sünkler, and Nurce Yagci. 2021. The inuence of search engine optimization on Google’s results: A multi- dimensional approach for detecting SEO. In Proceedings of the 13th ACM Web Science Conference 2021

2021
[51]

Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Zhilin Wang, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. 2024. MAGE: Machine- generated text detection in the wild. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024
[52]

Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, et al
[53]

Nature Human Behaviour (2025)

Quantifying large language model usage in scientic papers. Nature Human Behaviour (2025)

2025
[54]

Dominik Macko, Aashish Anantha Ramakrishnan, Jason S Lucas, Robert Moro, Ivan Srba, Adaku Uchendu, and Dongwon Lee. 2026. Be- yond Speculation: Measuring the Growing Presence of Large Language Model-Generated Texts in Multilingual Disinformation. Computer (2026)

2026
[55]

Ross A Malaga. 2008. Worst practices in search engine optimization. Commun. ACM (2008)

2008
[56]

Udi Manber et al . 1994. Finding similar les in a large le system. In Usenix winter

1994
[57]

Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, and Amir Houmansadr. 2025. Text-to-Image Models Leave Iden- tiable Signatures: Implications for Leaderboard Security. InLock-LLM Workshop: Prevent Unauthorized Knowledge Use from Large Language Models

2025
[58]

Hoang-Quoc Nguyen-Son, Minh-Son Dao, and Koji Zettsu. 2026. SearchLLM: Detecting LLM Paraphrased Text by Measuring the Simi- larity with Regeneration of the Candidate Source via Search Engine. In Proceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics. Association for Computational Linguistics

2026
[59]

Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fet- terly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web

2006
[60]

Petteri Nurmi, Musra Khan, Zahra Safaei, Ngoc Thi Nguyen, Fate- meh Sarhaddi, Mika Tompuri, Henrik Nygren, Päivi Kinnunen, and Agustin Zuniga. 2026. Ai see what you did there–the prevalence of llm-generated answers in mooc responses. In Proceedings of the 57th ACM Technical Symposium on Computer Science Education V. 1

2026
[61]

Jack O’Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O’Hearn. 2021. One function, fast everywhere. (2021). https: //github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf

2021
[62]

Benjamin D Pesante, Cyril Maurey, and Joshua A Parry. 2024. Rise of the machines: the prevalence and disclosure of articial intelligence– generated text in high-impact orthopaedic journals. JAAOS-Journal of the American Academy of Orthopaedic Surgeons (2024)

2024
[63]

ProjectDiscovery. 2025. projectdiscovery/wappalyzergo: A high per- formance go implementation of Wappalyzer Technology Detection Library. https://github.com/projectdiscovery/wappalyzergo

2025
[64]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jor- dan Homann, Francis Song, John Aslanides, Sarah Henderson, Ro- man Ring, Susannah Young, et al . 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)

work page internal anchor Pith review arXiv 2021
[65]

Colin Rael, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unied text-to-text transformer. Journal of machine learning research (2020)

2020
[66]

Giuseppe Russo, Manoel Horta Ribeiro, Tim Ruben Davidson, Veni- amin Veselovsky, and Robert West. 2025. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Accep- tance Rates. Proceedings of the ACM on Human-Computer Interaction (2025)

2025
[67]

McKenzie Sadeghi, Dimitris Dimitriadis, Virginia Padovese, Giulia Pozzi, Sara Badilini, Chiara Vercellone, Natalie Huet, Zack Fishman, Leonie Pfaller, and Natalie Adams. 2025. Tracking AI-enabled Misin- formation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Count- ing), Plus the Top False Narratives Generated by Articial Intelligence Tools. NewsGu...

2025
[68]

Kouichi Sakurai, Kaito Taguchi, and Yujie Gu. 2024. The Impact of Prompts on Zero-Shot Detection of AI-Generated Text. In Proceedings of the IJCAI 2024 Workshop on Articial Intelligence Safety (AISafety 2024). Co-located with the 33rd International Joint Conference on Articial Intelligence (IJCAI 2024)

2024
[69]

Outside the industry, nobody knows what we do

Sebastian Schultheiß and Dirk Lewandowski. 2021. “Outside the industry, nobody knows what we do” SEO as seen by search engine optimizers and content providers. Journal of Documentation (2021)

2021
[70]

Dongwon Shin and Sooel Son. 2026. LLMs Killed Q&A Stars? An- alyzing the Impact of LLM-Generated Answers on an Online Q&A Platform. In Proceedings of the ACM Web Conference 2026

2026
[71]

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. 2024. Dolma: An open corpus of three trillion tokens for language model pretraining research. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024
[72]

Nikita Spirin and Jiawei Han. 2012. Survey on web spam detec- tion: principles and algorithms. ACM SIGKDD explorations newsletter (2012)

2012
[73]

Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. 2023. Detectllm: Leveraging log rank information for zero-shot detection of machine- generated text. In Findings of the Association for Computational Lin- guistics: EMNLP 2023. 8

2023
[74]

Zhen Sun, Zongmin Zhang, Xinyue Shen, Ziyi Zhang, Yule Liu, Michael Backes, Yang Zhang, and Xinlei He. 2025. Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media. In Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics . Association for Computational Linguistics

2025
[75]

Anshuman Suri, Harsh Chaudhari, Yuefeng Peng, Ali Naseh, Alina Oprea, and Amir Houmansadr. 2026. Exploiting Leaderboards for Large-Scale Distribution of Malicious Models. In IEEE Symposium on Security and Privacy (S&P)

2026
[76]

The EasyList authors. 2025. EasyList - Overview. (2025). https: //easylist.to/

2025
[77]

Henry S Thompson. 2024. Improved methodology for longitudinal Web analytics using Common Crawl. In Proceedings of the 16th ACM Web Science Conference

2024
[78]

Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J Cozzolino, An- drew Gordon, David Rothschild, and Robert West. 2025. Prevalence and prevention of large language model use in crowd work. Commun. ACM (2025)

2025
[79]

James Liyuan Wang, Ran Li, Junfeng Yang, and Chengzhi Mao. 2024. RAFT: Realistic attacks to fool text detectors. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

2024
[80]

Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double- funnel: Connecting web spammers with advertisers. In Proceedings of the 16th international conference on World Wide Web

2007

Showing first 80 references.