{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:GTX6VVQE2MDPUE5RGF7WHHSHUB","short_pith_number":"pith:GTX6VVQE","schema_version":"1.0","canonical_sha256":"34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1","source":{"kind":"arxiv","id":"2605.13706","version":1},"attestation_state":"computed","paper":{"title":"Identifying AI Web Scrapers Using Canary Tokens","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.","cross_cats":["cs.AI","cs.CY","cs.NI"],"primary_cat":"cs.CR","authors_text":"Caroline Zhang, Emily Wenger, Enze Liu, Steven Seiden, Taein Kim, Triss Ren","submitted_at":"2026-05-13T15:53:57Z","abstract_excerpt":"From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e."},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2605.13706","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CR","submitted_at":"2026-05-13T15:53:57Z","cross_cats_sorted":["cs.AI","cs.CY","cs.NI"],"title_canon_sha256":"e4e558889f707ef54200c3ee5e57c4b4530b3047ce021a45b1bfd1b046a18cec","abstract_canon_sha256":"787bd3a73541827af7de0a594594d64bfa043c5d0c4872ea1adf3e2e39900eef"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T02:44:16.805915Z","signature_b64":"ZFIpVpUNT7GcRd6Zsk2jtu3uZs3Y4cHNoLGsqbda5PYA6nV2fvYqZsjoLw5dsAvuCT6kU2GBJQPEA9lrN+oDBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1","last_reissued_at":"2026-05-18T02:44:16.805441Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T02:44:16.805441Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Identifying AI Web Scrapers Using Canary Tokens","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.","cross_cats":["cs.AI","cs.CY","cs.NI"],"primary_cat":"cs.CR","authors_text":"Caroline Zhang, Emily Wenger, Enze Liu, Steven Seiden, Taein Kim, Triss Ren","submitted_at":"2026-05-13T15:53:57Z","abstract_excerpt":"From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e."},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That an LLM will reproduce a canary token in its generated output when the token was present in data collected by a scraper that fed the model, without the token being filtered or ignored during training or inference.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1d4be66eb56d5e8e53957af1983a284ae43ae49b9b5a6ca330290ff3beb88576"},"source":{"id":"2605.13706","kind":"arxiv","version":1},"verdict":{"id":"0c0ea7c7-c456-45a9-8feb-aa50f1c48e7d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T17:52:31.506880Z","strongest_claim":"Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies.","one_line_summary":"Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That an LLM will reproduce a canary token in its generated output when the token was present in data collected by a scraper that fed the model, without the token being filtered or ignored during training or inference.","pith_extraction_headline":"Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model."},"references":{"count":83,"sample":[{"doi":"","year":2025,"title":"The Walt Disney Company v","work_id":"62a4262f-a457-43c7-9eed-139637df9e4f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Liquid AI. n.d.. Liquid Playground. https://playground.liquid.ai/chat","work_id":"65692d91-519a-4247-8dee-8b0cee12e4bd","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Mistral AI. 2026. The all new le Chat, Your AI assistant for life and work Mistral AI. https://mistral.ai/news/all-new-le-chat","work_id":"ac291b9d-46bd-43a6-b85a-dca23240e89b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Amazon. n.d.. Web Grounding. https://docs.aws.amazon.com/nova/latest/nova2- userguide/web-grounding.html","work_id":"08c5ae86-7b82-49d5-a830-57c675cdea1f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Baidu. n.d.. ERNIE Bot. https://baike.baidu.com/en/item/ERNIE%20Bot/16840","work_id":"989af553-1cbd-40dc-9d19-141ab08b9030","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":83,"snapshot_sha256":"a153c28ff80f5aa3d2aae24ec7b1c34d1a7423b7046c9a819bd9e6f19245e97d","internal_anchors":2},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.13706","created_at":"2026-05-18T02:44:16.805516+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.13706v1","created_at":"2026-05-18T02:44:16.805516+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.13706","created_at":"2026-05-18T02:44:16.805516+00:00"},{"alias_kind":"pith_short_12","alias_value":"GTX6VVQE2MDP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"GTX6VVQE2MDPUE5R","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"GTX6VVQE","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB","json":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB.json","graph_json":"https://pith.science/api/pith-number/GTX6VVQE2MDPUE5RGF7WHHSHUB/graph.json","events_json":"https://pith.science/api/pith-number/GTX6VVQE2MDPUE5RGF7WHHSHUB/events.json","paper":"https://pith.science/paper/GTX6VVQE"},"agent_actions":{"view_html":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB","download_json":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB.json","view_paper":"https://pith.science/paper/GTX6VVQE","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.13706&json=true","fetch_graph":"https://pith.science/api/pith-number/GTX6VVQE2MDPUE5RGF7WHHSHUB/graph.json","fetch_events":"https://pith.science/api/pith-number/GTX6VVQE2MDPUE5RGF7WHHSHUB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB/action/storage_attestation","attest_author":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB/action/author_attestation","sign_citation":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB/action/citation_signature","submit_replication":"https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB/action/replication_record"}},"created_at":"2026-05-18T02:44:16.805516+00:00","updated_at":"2026-05-18T02:44:16.805516+00:00"}