pith. machine review for the scientific record. sign in

arxiv: 2604.23412 · v1 · submitted 2026-04-25 · 💻 cs.CL

Recognition: unknown

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy , Vincent Labatut , Xavier Bost , Hen-Hsen Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords copyrighted corporanon-reversible hashingtoken alignmentNLP dataset sharingliterary textscorpus distribution
0
0 comments X

The pith

Non-reversible hashing of tokens enables public sharing of annotations on copyrighted novels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Annotated corpora are essential for NLP but those containing copyrighted material cannot be freely exchanged. The paper shows how to share the annotations openly by also publishing a non-reversible hash of each token from the source text. A user who already owns a copy of the material applies the identical hash function to their own tokens and thereby matches them to the shared annotations. Experiments across different editions of novels confirm that 98.7 to 99.79 percent of tokens align correctly when the versions are reasonably close. The approach therefore removes the legal barrier to distributing useful labeled datasets while respecting copyright restrictions.

Core claim

By applying a non-reversible hash function to the tokens of a copyrighted text, annotations can be released publicly; any user who possesses their own copy of the text can hash its tokens and recover the matching annotations with alignment accuracy between 98.7 and 99.79 percent, provided their edition is sufficiently similar to the one used by the corpus creator.

What carries the argument

Non-reversible token hashing that produces fixed identifiers for each word or token, allowing exact matching to annotations without exposing the original text content.

If this is right

  • Creators can release and exchange annotated corpora that contain copyrighted literary material without legal violation.
  • Users need only a legal personal copy of the source text to locate and apply the shared annotations.
  • Alignment accuracy stays high across minor textual differences such as those found in different published editions.
  • A ready-to-use Python implementation is provided so the method can be applied immediately to new corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hashing principle could be tested on other media such as scripts or articles if comparable token or segment identifiers can be defined.
  • Wider use would increase the pool of openly available annotated data that reflects the full range of published language.

Load-bearing premise

A user's owned copy of the text must be close enough to the creator's version that hashing produces the same token identifiers at high accuracy, and the chosen hash function must remain non-reversible in practice.

What would settle it

An experiment on two close editions of the same novel where token alignment accuracy falls substantially below 98 percent, or a successful reversal of the hash function that reconstructs original token strings from the published hashes.

Figures

Figures reproduced from arXiv: 2604.23412 by Arthur Amalvy, Hen-Hsen Huang, Vincent Labatut, Xavier Bost.

Figure 1
Figure 1. Figure 1: Proposed hashing scheme. Top part: the corpus creator’s plain text ( view at source ↗
Figure 3
Figure 3. Figure 3: An example of applying the retokenize strat￾egy on a substitution case (shown in purple). The corpus creator and user perform a different tokenization, result￾ing in tokens “runner” and “up” vs. “runner-up”, respectively. By testing all possible splits of the user’s token, the strategy is able to recover from this error (in cyan). case strategy Certain substitutions are due to a different casing between th… view at source ↗
Figure 2
Figure 2. Figure 2: An example of applying the propagate strat￾egy to retrieve a token missing on the user’s side (shown in red), by leveraging the fact that the same hash value (37) is matched in another location (shown in cyan). Note that the creator’s plain text X is not available at the alignment stage. retokenize strategy In the case of a substitu￾tion, there is a possibility that the creator’s tokens underwent a differe… view at source ↗
Figure 4
Figure 4. Figure 4: Mean number of hash collisions per token view at source ↗
Figure 6
Figure 6. Figure 6: Percentage of misaligned tokens depending on view at source ↗
Figure 7
Figure 7. Figure 7: Percentage of alignment errors as a function of the ratio of synthetic errors added. view at source ↗
Figure 8
Figure 8. Figure 8: Duration of alignment depending on the strat view at source ↗
Figure 9
Figure 9. Figure 9: Percentage of alignment errors as a function of the ratio of synthetic errors added to the CoNLL 2003 view at source ↗
Figure 10
Figure 10. Figure 10: Percentage of alignment errors as a function of the ratio of synthetic errors added to the WNUT 2017 view at source ↗
Figure 11
Figure 11. Figure 11: Percentage of alignment errors as a function of synthetic errors added to the CoNLL 2003 and WNUT view at source ↗
Figure 12
Figure 12. Figure 12: Percentage of errors using our mlm alignment strategy, depending on the context window size and user edition. Number of false positives view at source ↗
Figure 13
Figure 13. Figure 13: Number of false positives using different view at source ↗
read the original abstract

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method for distributing annotated corpora containing copyrighted literary texts by publicly sharing the annotations in plaintext alongside a non-reversible hashed version of the tokenized source material. Corpus users who already own a copy of the text apply the same public hash function to their own tokens and align them to the shared annotations. The approach is claimed to be robust to minor version differences between editions. Experiments on multiple editions of novels report token alignment accuracies ranging from 98.7% to 99.79%, and the authors release a Python implementation called novelshare.

Significance. If the non-reversibility claim were technically sound, the method would address a genuine practical barrier to sharing richly annotated NLP datasets that include copyrighted material. The reported alignment rates on real novels and the public code release would constitute useful engineering contributions for dataset creators. However, the central premise that a public deterministic per-token hash yields a non-reversible representation does not hold under standard cryptographic and information-theoretic analysis, which substantially reduces the practical and legal significance of the work.

major comments (2)
  1. [Abstract] Abstract and method description: the claim that the shared output is a 'non-reversible hashed version' is incorrect. Because the hash function must be public (so that users can apply it to their own tokens) and is applied deterministically to each token, an adversary can build a complete reverse mapping by hashing every token in the tokenizer vocabulary (or all common words) once and then looking up each hash in the shared sequence. This recovers the original token sequence without any access to the user's copy, directly contradicting the non-reversibility premise on which the lawful-distribution argument rests.
  2. [Abstract] Abstract and § on hash function (method section): no specific hash function is named, no security analysis or collision-resistance argument is provided, and no discussion addresses the bounded input space of a tokenizer vocabulary. These omissions are load-bearing because the entire distribution scheme depends on the hash remaining non-invertible in practice.
minor comments (1)
  1. [Experiments] The alignment experiments report concrete percentages but do not specify the exact novels, tokenizers, or divergence metrics used; adding these details would improve reproducibility even if the security issue is addressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for identifying key issues with the description of our method. We agree that the original claims regarding non-reversibility were overstated and that the manuscript lacks necessary implementation details. We will revise the paper to correct these points while preserving the practical contribution of the alignment technique for users who own the source texts.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the claim that the shared output is a 'non-reversible hashed version' is incorrect. Because the hash function must be public (so that users can apply it to their own tokens) and is applied deterministically to each token, an adversary can build a complete reverse mapping by hashing every token in the tokenizer vocabulary (or all common words) once and then looking up each hash in the shared sequence. This recovers the original token sequence without any access to the user's copy, directly contradicting the non-reversibility premise on which the lawful-distribution argument rests.

    Authors: We acknowledge the validity of this observation. A public deterministic hash over a finite vocabulary does permit precomputation of a reverse lookup table, allowing recovery of the token sequence from the shared hashes alone. This undermines the original phrasing of 'non-reversible' and the associated legal-distribution argument. In the revised manuscript we will eliminate all references to non-reversibility, describe the shared data simply as a hashed token sequence for alignment, and add an explicit limitations paragraph noting that the approach relies on users possessing their own copies of the source material rather than on cryptographic protection. We will also discuss the practical implications for dataset sharing. revision: yes

  2. Referee: [Abstract] Abstract and § on hash function (method section): no specific hash function is named, no security analysis or collision-resistance argument is provided, and no discussion addresses the bounded input space of a tokenizer vocabulary. These omissions are load-bearing because the entire distribution scheme depends on the hash remaining non-invertible in practice.

    Authors: We agree that the manuscript should have provided these details. The revised version will specify a concrete hash function (MurmurHash3 64-bit for alignment efficiency), include implementation pseudocode, and add a dedicated subsection on design choices. This subsection will address the bounded vocabulary explicitly, explain that inversion is feasible in principle, and clarify that the method's utility is intended for compliant users who already own the text rather than for adversarial resistance. We will remove any implication that the hash provides security guarantees. revision: yes

Circularity Check

0 steps flagged

No circularity: practical method with experimental validation, no derivations or self-referential steps

full rationale

The paper describes a hashing-based technique for sharing annotations on copyrighted texts, relying on users owning the source material and applying the same public hash function for alignment. Alignment accuracy is reported from direct experiments on novel editions (98.7–99.79% token match rates), not from any fitted parameters or predictions derived by construction. No equations, uniqueness theorems, ansatzes, or self-citations appear as load-bearing elements in the provided text. The non-reversibility claim is presented as a property of the chosen hash function rather than derived from prior results within the paper itself. This is a self-contained engineering proposal whose central claims rest on empirical alignment tests, not tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard cryptographic assumptions about hash functions and the empirical observation that text versions are similar enough for alignment; no free parameters or invented entities are introduced.

axioms (1)
  • standard math Cryptographic hash functions are non-reversible and produce consistent outputs for identical inputs
    Invoked to ensure annotations can be matched without exposing original text.

pith-pipeline@v0.9.0 · 5490 in / 1108 out tokens · 42824 ms · 2026-05-08T08:19:14.717299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Ahmed, A

    Extracting books from production language models. arXiv, cs.CL:2601.02671. T. Alrashid and R. Gaizauskas

  2. [2]

    In 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalua- tion (LREC-COLING 2024), LREC-COLING, page 2440–2446

    Bln600: A parallel corpus of machine/human tran- scribed nineteenth century newspaper texts. In 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalua- tion (LREC-COLING 2024), LREC-COLING, page 2440–2446. X. Bost, V . Labatut, and G. Linares

  3. [3]

    CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models.arXiv, cs.CL:2408.17428v2. J. Bourne

  4. [4]

    CREFT: Sequential multi- agent llm for character relation extraction.arXiv, cs.CL:2505.24553. N. Dekker, T. Kuhn, and M. van Erp

  5. [5]

    Dobb’s Journal, July 1988:46

    Pattern match- ing: The gestalt approach.Dr. Dobb’s Journal, July 1988:46. E. F. Tjong Kim Sang and F. De Meulder

  6. [6]

    In7th Confer- ence on Natural Language Learning, pages 142–147

    Intro- duction to the CoNLL-2003 shared task: Language- independent named entity recognition. In7th Confer- ence on Natural Language Learning, pages 142–147. M. van de Camp and A. van den Bosch

  7. [7]

    Deepseek-ocr: Con- texts optical compression.arXiv, cs.CV:2510.18234. H. Zhao, Y . Yan, S. Zhu, H. Liu, Y . Jia, H. Zan, and M. Peng

  8. [8]

    In particular, it forbids their unauthorized reproduction, in whole or in part (Article 9), and their communication to the public (Articles 11 & 11bis). Under directive2001/29/EC 3, European Union law forbids the direct or indirect unauthorized re- production of copyrighted works or substantial parts thereof (Article 2), and their unauthorized 2https://ww...

  9. [9]

    These annotations are shared in plain text, but they are technical data, and not part of the original literary work

    the associated annotations authored by the researchers creating the corpus. These annotations are shared in plain text, but they are technical data, and not part of the original literary work. Put differ- ently, they are distinct from the expressive content of the literary work (i.e. its actual words, narrative voice, the author’s stylistic choices). As s...

  10. [10]

    concluded that Google 4https://ipcuria.eu/case?reference=C-5/08 5https://ipcuria.eu/case?reference=C-476/17 6https://www.copyright.gov/title17/ 7https://www.legislation.gov.uk/ukpga/1988/48/ contents 8https://eur-lex.europa.eu/eli/dir/2019/790/ oj/eng 9https://www.copyright.gov/fair-use/summaries/ authorsguild-google-2dcir2015.pdf Books’ scanning for inde...

  11. [11]

    ForPride and Prejudice, we use Wikisource for both the first (PP-1813) and second (PP-1817) edi- tions in our experiments

    to extract text from the book images hosted at the Melville Electronic Library. ForPride and Prejudice, we use Wikisource for both the first (PP-1813) and second (PP-1817) edi- tions in our experiments. We include the PP-1894 edition through project Gutenberg. Since this ver- sion is illustrated, the raw project Gutenberg text contain descriptions of the ...

  12. [12]

    We empirically observe that errors beyond (0.2,0.05) (WER, CER) cor- respond to heavy OCR errors that are difficult to recover from. E Results on Other Domains To ensure that our technique is applicable to do- mains other than literary texts, we perform addi- tional experiments on known NLP corpora: CoNLL 2003 (Tjong Kim Sang and De Meulder,

  13. [13]

    for the newswire domain and WNUT 2017 (Derczyn- ski et al.,

  14. [14]

    For both of these datasets, we observe lower er- rors across the boards

    and Figure 10 (WNUT 2017). For both of these datasets, we observe lower er- rors across the boards. This is at least partially due to the facts that these datasets are divided into examples that are smaller than chapters, facilitat- ing alignment. The relative comparison between strategies yield similar results than on our corpus of novels, showing their ...

  15. [15]

    We use the MD-1988 edition ofMoby Dick, as it is the closest from the Novelties version

    as the source version. We use the MD-1988 edition ofMoby Dick, as it is the closest from the Novelties version. We use a strict defini- tion of the notion of error: if a single token from an entity was not aligned, we consider the entire entity as non-aligned. We obtain a percentage of errors of 0.82% with our best strategy pipe, indi- cating that we are ...

  16. [16]

    Adjustment Team

    For both of these datasets, we see that entities are 0.0 0.04 0.08 0.12 0.16 Ratio of syntethic errors 0.0% 0.0% 0.0% 0.1%Percentage of errors add case naive pipe mlm retokenize propagate 0.0 0.04 0.08 0.12 0.16 Ratio of syntethic errors 0.0% 5.0% 10.0% 15.0%Percentage of errors delete case naive pipe mlm retokenize propagate (0.0,0.0) (0.2,0.05) (0.4,0.1...