Recognition: unknown
DeGenTWeb: A First Look at LLM-dominant Websites
Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3
The pith
DeGenTWeb finds LLM-dominant websites prevalent and growing in Common Crawl and Bing search results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeGenTWeb systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. It adapts detectors of LLM-generated text for use on web pages and aggregates detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb on Common Crawl and Bing data shows LLM-dominant sites are highly prevalent and their share is growing over time, though accurate identification appears challenging with the latest LLMs.
What carries the argument
DeGenTWeb, a system that adapts LLM-generated text detectors to web pages and aggregates per-page results at the site level to classify LLM-dominant sites.
If this is right
- LLM-generated content constitutes a large and increasing fraction of the web as captured in public crawls.
- Search engine results contain a substantial and growing number of LLM-dominant sites.
- Detection accuracy for such sites will decline further as newer LLMs improve at evading detectors.
- Continued reliance on web data for AI training will incorporate more synthetic content over time.
Where Pith is reading between the lines
- Search engines may need to develop labeling or filtering methods for LLM-dominant results to maintain user trust.
- If the trend continues, future web-trained models could face degraded performance from ingesting mostly synthetic data.
- Independent manual sampling of classified sites could serve as a low-cost check on the automated method's error rates.
Load-bearing premise
Detectors of LLM-generated text, after adaptation to web pages and site-level aggregation, can reliably separate LLM-dominant sites from human-authored ones despite performing much worse than advertised when minimizing false positives.
What would settle it
A manual audit of hundreds of sites labeled LLM-dominant by DeGenTWeb that finds most were primarily human-authored, or a similar audit of sites labeled human-authored that finds most were LLM-dominant.
Figures
read the original abstract
Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeGenTWeb, a framework for identifying LLM-dominant websites (those with content generated primarily by LLMs with little human input). It describes adaptations of existing LLM-generated text detectors for web pages, site-level aggregation of per-page detections, and applies the method to Common Crawl snapshots and Bing search results. The central claims are that LLM-dominant sites are highly prevalent in both datasets and that their share has grown over time, while noting that accurate identification remains challenging with current detectors.
Significance. If the detection pipeline can be shown to achieve reliable precision, the work would supply the first large-scale, systematic measurement of LLM-generated web content, moving beyond anecdotal news reports. The use of representative web corpora (Common Crawl) and search-engine results is a methodological strength that could inform future studies on content provenance and search quality.
major comments (3)
- [Abstract] Abstract: The headline claims that LLM-dominant sites are 'highly prevalent' in Common Crawl and Bing results and that 'this share is growing over time' rest on the assumption that the adapted detectors plus site-level aggregation produce sufficiently low false-positive rates. The abstract itself states that detectors 'perform much worse than advertised' when minimizing false positives, yet the manuscript provides no quantitative validation—no precision-recall figures, no error rates on human-labeled site ground truth, and no comparison of site-level decisions against manual annotation. This absence directly undermines the prevalence and growth conclusions.
- [Detector adaptation and aggregation] Section describing detector adaptation and aggregation (likely §3): The paper asserts that adaptation to web pages and multi-page aggregation enable 'accurate site-level categorization.' However, no ablation or sensitivity analysis is shown demonstrating that these steps raise precision above the baseline low-FP failure mode acknowledged in the abstract. Without such evidence (e.g., performance on mixed-content or post-edited pages), false positives could systematically inflate the reported shares.
- [Results] Results section (likely §4): The reported prevalence figures and temporal trends lack accompanying confidence intervals, threshold-sensitivity tests, or error analysis on edge cases such as sites with both human and LLM content. Given the acknowledged detector weaknesses, these omissions make it impossible to assess whether the 'growing' trend is robust or an artifact of changing detector behavior.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly quantified the scale of the Common Crawl and Bing samples used.
- [Methodology] Notation for site-level aggregation (e.g., how per-page scores are combined) should be formalized with a short equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. The feedback correctly identifies areas where additional validation and analysis would strengthen the claims. We respond to each major comment below and will incorporate revisions to address the concerns.
read point-by-point responses
-
Referee: [Abstract] The headline claims that LLM-dominant sites are 'highly prevalent' in Common Crawl and Bing results and that 'this share is growing over time' rest on the assumption that the adapted detectors plus site-level aggregation produce sufficiently low false-positive rates. The abstract itself states that detectors 'perform much worse than advertised' when minimizing false positives, yet the manuscript provides no quantitative validation—no precision-recall figures, no error rates on human-labeled site ground truth, and no comparison of site-level decisions against manual annotation. This absence directly undermines the prevalence and growth conclusions.
Authors: We agree that the absence of site-level ground-truth validation limits the strength of the prevalence claims. Our estimates use conservative thresholds chosen specifically to minimize false positives, and we report consistent trends across multiple detectors. To directly address this, the revised manuscript will add a validation subsection that manually annotates a random sample of 200 sites (100 from each dataset) to estimate precision and characterize false-positive cases. revision: yes
-
Referee: [Detector adaptation and aggregation] The paper asserts that adaptation to web pages and multi-page aggregation enable 'accurate site-level categorization.' However, no ablation or sensitivity analysis is shown demonstrating that these steps raise precision above the baseline low-FP failure mode acknowledged in the abstract. Without such evidence (e.g., performance on mixed-content or post-edited pages), false positives could systematically inflate the reported shares.
Authors: The adaptation filters non-text content and aggregates page-level scores via averaging or majority vote across sampled pages. While the original submission did not include explicit ablations, we will add sensitivity analyses that vary the number of pages per site and the aggregation threshold, plus evaluation on a set of known mixed-content sites, to quantify the improvement over single-page detection. revision: yes
-
Referee: [Results] The reported prevalence figures and temporal trends lack accompanying confidence intervals, threshold-sensitivity tests, or error analysis on edge cases such as sites with both human and LLM content. Given the acknowledged detector weaknesses, these omissions make it impossible to assess whether the 'growing' trend is robust or an artifact of changing detector behavior.
Authors: We will augment the results section with bootstrap confidence intervals for all prevalence estimates and add threshold-sensitivity plots. We will also include a qualitative discussion of mixed-content sites, noting that our conservative detection strategy tends to classify borderline cases as non-LLM-dominant. revision: yes
Circularity Check
No circularity: empirical measurement study using external detectors and data
full rationale
The paper describes an empirical methodology (DeGenTWeb) that adapts off-the-shelf LLM-generated text detectors for web pages and aggregates per-page results to site-level labels, then applies the resulting classifier to independent external corpora (Common Crawl snapshots and Bing search results) to report prevalence and temporal trends. No equations, fitted parameters, or self-referential derivations are present; the central claims are direct measurements on public data rather than predictions derived from the method's own outputs or prior self-citations. The approach is therefore self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing LLM-generated text detectors can be adapted to web pages and aggregated to produce reliable site-level labels for LLM-dominant content.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. AI Website Builder - Create a Website in Minutes | Wix. https: //www.wix.com/ai-website-builder
-
[2]
[n. d.]. B12 | The easiest AI website builder. https://www.b12.io/
-
[3]
Zainab Ahmad, Miguel Torres-Ruiz, Ahmad Mahmood, Rolando Quin- tero, Iqra Ameer, and Necva Bölücü. 2026. Human or Machine? A Survey on Machine-Generated Text Detection. IEEE Access (2026)
2026
-
[4]
Ahrefs. 2025. 74% of New Webpages Include AI Content (Study of 900k Pages). https://ahrefs.com/blog/what-percentage-of-new-content-is- ai-generated/
2025
-
[5]
AnnualReports.com. 2025. Russell 2000 Index Companies. https: //www.annualreports.com/FeaturedProgram/15
2025
-
[6]
Calvin Ardi and John Heidemann. 2019. Precise Detection of Content Reuse in the Web. (2019)
2019
-
[7]
Articial Analysis. 2025. Articial Analysis Intelligence Index. https:// articialanalysis.ai/evaluations/articial-analysis-intelligence-index
2025
-
[8]
Stefan Baack. 2024. A critical analysis of the largest source for gener- ative ai training data: Common crawl. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency
2024
-
[9]
Sankalp Bahad, Yash Bhaskar, and Parameswari Krishnamurthy. 2024. Fine-tuning Language Models for AI vs Human Generated Text de- tection. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). Association for Computational Linguistics
2024
-
[10]
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast-DetectGPT: Ecient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. In The Twelfth International Conference on Learning Representations
2024
-
[11]
Adrien Barbaresi. 2020. htmldate: A Python package to extract publi- cation dates from web pages. Journal of Open Source Software (2020)
2020
-
[12]
Adrien Barbaresi. 2021. Tralatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations . Association for Computatio...
2021
-
[13]
Janek Bevendor, Sanket Gupta, Johannes Kiesel, and Benno Stein
-
[14]
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
An Empirical Comparison of Web Content Extraction Algo- rithms. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
-
[15]
Janek Bevendor, Matti Wiegmann, Martin Potthast, and Benno Stein
-
[16]
In European Conference on Information Re- trieval
Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines. In European Conference on Information Re- trieval. Springer
-
[17]
Janek Bevendor, Matti Wiegmann, Emmelie Richter, Martin Potthast, and Benno Stein. 2025. The Two Paradigms of LLM Detection: Au- thorship Attribution vs Authorship Verication. In Findings of the Association for Computational Linguistics: ACL 2025
2025
-
[18]
Bynder. 2024. Study reveals how consumers interact with AI- generated content vs human-made. Bynder Press and Media (03 April 2024). https://www.bynder.com/en/press-media/ai-vs-human-made- content-study/
2024
-
[19]
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
2007
-
[20]
Common Crawl. 2025. Common Crawl. https://commoncrawl.org/
2025
-
[21]
Giulio Corsi, Bill Marino, and Willow Wong. 2024. The spread of synthetic media on X. Harvard Kennedy School Misinformation Review (2024)
2024
-
[22]
Cristian Danescu-Niculescu-Mizil, Andrei Z Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2010. Competing for users’ attention: on the interplay between organic and sponsored search results. In Proceedings of the 19th international conference on World wide web
2010
-
[23]
Ye Du, Yaoyun Shi, and Xin Zhao. 2007. Using spam farm to boost PageRank. In Proceedings of the 3rd international workshop on Adver- sarial information retrieval on the web
2007
-
[24]
Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Mag- nus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch
-
[25]
RAID: A shared benchmark for robust evalua- tion of machine-generated text detectors
RAID: A Shared Benchmark for Robust Evaluation of Machine- Generated Text Detectors. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Association for Com- putational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.674
-
[26]
Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, and Min Zhang. 2025. Your language model can secretly write like humans: Contrastive paraphrase attacks on llm-generated text detectors. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
2025
-
[27]
Charles Floate. 2015. A Beginners Guide to Buying SAPE Links . https: //www.charlesoate.com/sape-links Accessed 2026-04-28
2015
-
[28]
Joel Frank, Franziska Herbert, Jonas Ricker, Lea Schönherr, Thorsten Eisenhofer, Asja Fischer, Markus Dürmuth, and Thorsten Holz. 2024. A representative study on human detection of articially generated me- dia across countries. In 2024 IEEE Symposium on Security and Privacy (SP). Ieee
2024
-
[29]
Graphite.io. 2025. More Articles Are Now Created by AI Than Humans. https://graphite.io/ve-percent/more-articles-are-now- created-by-ai-than-humans
2025
-
[30]
Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, and Baosheng Wang. 2025. League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models. arXiv preprint arXiv:2507.22359 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web Spam Taxon- omy. In AIRWeb
2005
-
[32]
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. In Proceedings of the 41st International Conference on Machine Learning
2024
-
[33]
Wei Hao, Ran Li, Weiliang Zhao, Junfeng Yang, and Chengzhi Mao
-
[34]
In Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics
Learning to rewrite: Generalized llm-generated text detection. In Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics
-
[35]
Wei Hao, Van Tran, Vincent Rideout, Zixi Wang, AnMei Dasbach- Prisk, MH A, Junfeng Yang, Ethan Katz-Bassett, Grant Ho, and Asaf Cidon. 2025. Do spammers dream of electric sheep? characterizing the prevalence of llm-generated malicious emails. In Proceedings of the 2025 ACM Internet Measurement Conference
2025
-
[36]
Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bern- hard Scholkopf
Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bern- hard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications (1998)
1998
-
[37]
Benjamin Homan. 2024. First Came ‘Spam. ’ Now, With A.I., We’ve Got ‘Slop’. The New York Times (2024). https://www.nytimes.com/ 2024/06/11/style/ai-search-slop.html
2024
-
[38]
Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. [n. d.]. RADAR: Robust AI-Text Detection via Adversarial Learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc
-
[39]
Yifei Huang, Jiuxin Cao, Hanyu Luo, Xin Guan, and Bo Liu. 2025. Magret: Machine-generated text detection with rewritten texts. In 7 Proceedings of the 31st International Conference on Computational Lin- guistics
2025
-
[40]
IndieWeb community. 2025. IndieWeb. https://indieweb.org/
2025
-
[41]
Nikhil Jha, Martino Trevisan, Luca Vassio, and Marco Mellia. 2022. The Internet with Privacy Policies: Measuring The Web Upon Consent. (2022)
2022
-
[42]
Kaizer and Minaxi Gupta
Andrew J. Kaizer and Minaxi Gupta. 2016. Characterizing Website Behaviors Across Logged-in and Not-logged-in Users. Association for Computing Machinery
2016
-
[43]
Kiwix. 2023. wikihow_en_maxi_2023-03.zim : WikiHow : Free Down- load, Borrow, and Streaming : Internet Archive. https://archive.org/ details/wiki-how-en
2023
-
[44]
Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, and Jan Lause. 2025. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances (2025)
2025
- [45]
-
[46]
Kayvan Kousha and Mike Thelwall. 2026. How much are LLMs chang- ing the language of academic papers after ChatGPT? A multi-database and full text analysis. Scientometrics (2026)
2026
- [47]
-
[48]
Nathan Lambert and Florian Brand. 2026. The ATOM Report: Measuring the Open Language Model Ecosystem. arXiv preprint arXiv:2604.07190 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Luke Leal. 2020. Hidden SEO Spam Link Injections on WordPress Sites . Sucuri Blog. https://blog.sucuri.net/2020/11/hidden-seo-spam-link- injections-on-wordpress-sites.html Accessed 2026-04-28
2020
-
[50]
Dirk Lewandowski, Sebastian Sünkler, and Nurce Yagci. 2021. The inuence of search engine optimization on Google’s results: A multi- dimensional approach for detecting SEO. In Proceedings of the 13th ACM Web Science Conference 2021
2021
-
[51]
Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Zhilin Wang, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. 2024. MAGE: Machine- generated text detection in the wild. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
2024
-
[52]
Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, et al
-
[53]
Nature Human Behaviour (2025)
Quantifying large language model usage in scientic papers. Nature Human Behaviour (2025)
2025
-
[54]
Dominik Macko, Aashish Anantha Ramakrishnan, Jason S Lucas, Robert Moro, Ivan Srba, Adaku Uchendu, and Dongwon Lee. 2026. Be- yond Speculation: Measuring the Growing Presence of Large Language Model-Generated Texts in Multilingual Disinformation. Computer (2026)
2026
-
[55]
Ross A Malaga. 2008. Worst practices in search engine optimization. Commun. ACM (2008)
2008
-
[56]
Udi Manber et al . 1994. Finding similar les in a large le system. In Usenix winter
1994
-
[57]
Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, and Amir Houmansadr. 2025. Text-to-Image Models Leave Iden- tiable Signatures: Implications for Leaderboard Security. InLock-LLM Workshop: Prevent Unauthorized Knowledge Use from Large Language Models
2025
-
[58]
Hoang-Quoc Nguyen-Son, Minh-Son Dao, and Koji Zettsu. 2026. SearchLLM: Detecting LLM Paraphrased Text by Measuring the Simi- larity with Regeneration of the Candidate Source via Search Engine. In Proceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics. Association for Computational Linguistics
2026
-
[59]
Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fet- terly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web
2006
-
[60]
Petteri Nurmi, Musra Khan, Zahra Safaei, Ngoc Thi Nguyen, Fate- meh Sarhaddi, Mika Tompuri, Henrik Nygren, Päivi Kinnunen, and Agustin Zuniga. 2026. Ai see what you did there–the prevalence of llm-generated answers in mooc responses. In Proceedings of the 57th ACM Technical Symposium on Computer Science Education V. 1
2026
-
[61]
Jack O’Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O’Hearn. 2021. One function, fast everywhere. (2021). https: //github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
2021
-
[62]
Benjamin D Pesante, Cyril Maurey, and Joshua A Parry. 2024. Rise of the machines: the prevalence and disclosure of articial intelligence– generated text in high-impact orthopaedic journals. JAAOS-Journal of the American Academy of Orthopaedic Surgeons (2024)
2024
-
[63]
ProjectDiscovery. 2025. projectdiscovery/wappalyzergo: A high per- formance go implementation of Wappalyzer Technology Detection Library. https://github.com/projectdiscovery/wappalyzergo
2025
-
[64]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jor- dan Homann, Francis Song, John Aslanides, Sarah Henderson, Ro- man Ring, Susannah Young, et al . 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)
work page internal anchor Pith review arXiv 2021
-
[65]
Colin Rael, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unied text-to-text transformer. Journal of machine learning research (2020)
2020
-
[66]
Giuseppe Russo, Manoel Horta Ribeiro, Tim Ruben Davidson, Veni- amin Veselovsky, and Robert West. 2025. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Accep- tance Rates. Proceedings of the ACM on Human-Computer Interaction (2025)
2025
-
[67]
McKenzie Sadeghi, Dimitris Dimitriadis, Virginia Padovese, Giulia Pozzi, Sara Badilini, Chiara Vercellone, Natalie Huet, Zack Fishman, Leonie Pfaller, and Natalie Adams. 2025. Tracking AI-enabled Misin- formation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Count- ing), Plus the Top False Narratives Generated by Articial Intelligence Tools. NewsGu...
2025
-
[68]
Kouichi Sakurai, Kaito Taguchi, and Yujie Gu. 2024. The Impact of Prompts on Zero-Shot Detection of AI-Generated Text. In Proceedings of the IJCAI 2024 Workshop on Articial Intelligence Safety (AISafety 2024). Co-located with the 33rd International Joint Conference on Articial Intelligence (IJCAI 2024)
2024
-
[69]
Outside the industry, nobody knows what we do
Sebastian Schultheiß and Dirk Lewandowski. 2021. “Outside the industry, nobody knows what we do” SEO as seen by search engine optimizers and content providers. Journal of Documentation (2021)
2021
-
[70]
Dongwon Shin and Sooel Son. 2026. LLMs Killed Q&A Stars? An- alyzing the Impact of LLM-Generated Answers on an Online Q&A Platform. In Proceedings of the ACM Web Conference 2026
2026
-
[71]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. 2024. Dolma: An open corpus of three trillion tokens for language model pretraining research. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
2024
-
[72]
Nikita Spirin and Jiawei Han. 2012. Survey on web spam detec- tion: principles and algorithms. ACM SIGKDD explorations newsletter (2012)
2012
-
[73]
Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. 2023. Detectllm: Leveraging log rank information for zero-shot detection of machine- generated text. In Findings of the Association for Computational Lin- guistics: EMNLP 2023. 8
2023
-
[74]
Zhen Sun, Zongmin Zhang, Xinyue Shen, Ziyi Zhang, Yule Liu, Michael Backes, Yang Zhang, and Xinlei He. 2025. Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media. In Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics . Association for Computational Linguistics
2025
-
[75]
Anshuman Suri, Harsh Chaudhari, Yuefeng Peng, Ali Naseh, Alina Oprea, and Amir Houmansadr. 2026. Exploiting Leaderboards for Large-Scale Distribution of Malicious Models. In IEEE Symposium on Security and Privacy (S&P)
2026
-
[76]
The EasyList authors. 2025. EasyList - Overview. (2025). https: //easylist.to/
2025
-
[77]
Henry S Thompson. 2024. Improved methodology for longitudinal Web analytics using Common Crawl. In Proceedings of the 16th ACM Web Science Conference
2024
-
[78]
Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J Cozzolino, An- drew Gordon, David Rothschild, and Robert West. 2025. Prevalence and prevention of large language model use in crowd work. Commun. ACM (2025)
2025
-
[79]
James Liyuan Wang, Ran Li, Junfeng Yang, and Chengzhi Mao. 2024. RAFT: Realistic attacks to fool text detectors. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
2024
-
[80]
Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double- funnel: Connecting web spammers with advertisers. In Proceedings of the 16th international conference on World Wide Web
2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.