pith. sign in
Pith Number

pith:GTX6VVQE

pith:2026:GTX6VVQE2MDPUE5RGF7WHHSHUB
not attested not anchored not stored refs resolved

Identifying AI Web Scrapers Using Canary Tokens

Caroline Zhang, Emily Wenger, Enze Liu, Steven Seiden, Taein Kim, Triss Ren

Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.

arxiv:2605.13706 v1 · 2026-05-13 · cs.CR · cs.AI · cs.CY · cs.NI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{GTX6VVQE2MDPUE5RGF7WHHSHUB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies.

C2weakest assumption

That an LLM will reproduce a canary token in its generated output when the token was present in data collected by a scraper that fed the model, without the token being filtered or ignored during training or inference.

C3one line summary

Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.

References

83 extracted · 83 resolved · 2 Pith anchors

[1] The Walt Disney Company v 2025
[2] Liquid AI. n.d.. Liquid Playground. https://playground.liquid.ai/chat
[3] Mistral AI. 2026. The all new le Chat, Your AI assistant for life and work Mistral AI. https://mistral.ai/news/all-new-le-chat 2026
[4] Amazon. n.d.. Web Grounding. https://docs.aws.amazon.com/nova/latest/nova2- userguide/web-grounding.html
[5] Baidu. n.d.. ERNIE Bot. https://baike.baidu.com/en/item/ERNIE%20Bot/16840
Receipt and verification
First computed 2026-05-18T02:44:16.805441Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1

Aliases

arxiv: 2605.13706 · arxiv_version: 2605.13706v1 · doi: 10.48550/arxiv.2605.13706 · pith_short_12: GTX6VVQE2MDP · pith_short_16: GTX6VVQE2MDPUE5R · pith_short_8: GTX6VVQE
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "787bd3a73541827af7de0a594594d64bfa043c5d0c4872ea1adf3e2e39900eef",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CY",
      "cs.NI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CR",
    "submitted_at": "2026-05-13T15:53:57Z",
    "title_canon_sha256": "e4e558889f707ef54200c3ee5e57c4b4530b3047ce021a45b1bfd1b046a18cec"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13706",
    "kind": "arxiv",
    "version": 1
  }
}