How much of an LLM-generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification

Ali H. Lazem; William J. Teahan

arxiv: 2606.29605 · v1 · pith:WYWMMVGLnew · submitted 2026-06-28 · 💻 cs.CL

How much of an LLM-generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification

Ali H. Lazem , William J. Teahan This is my paper

Pith reviewed 2026-06-30 07:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-generated corpuscontent redundancyclinical NLPprovenance classificationdata deduplicationtoken efficiencyclinical extraction pipeline

0 comments

The pith

Provenance classification shows an LLM clinical extraction pipeline produces only 10.9 percent trainable-unique content amid 79.4 percent redundancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the reported scale of an LLM-generated clinical corpus reflects actual new information available for training. It introduces a token-level breakdown across 2.51 billion tokens from a multi-agent pipeline applied to 167,034 patient narratives and finds that most output repeats source material or duplicates across records. Removing this redundancy before adaptation improves a clinical encoder on external disease-recognition benchmarks at fixed token budget. The level of redundancy varies sharply by output channel rather than being uniform across LLM extraction.

Core claim

Only 10.9 percent of the output is trainable-unique content while 79.4 percent is redundant, so raw token count overstates information content by roughly ninefold. The redundancy arises through verbatim copying of source context into per-item fields and duplication of generated text across records, of which only the former is losslessly removable. An independent compression analysis recovers the same mechanisms, one pipeline channel shows almost no redundancy, and the skew affects the token-level training distribution. De-duplicating the corpus before adaptation improves a clinical encoder on external disease-recognition benchmarks at equal token budget across adaptation depths and a second

What carries the argument

Provenance-based Redundancy Decomposition, a token-level classification of every generated token by its originating source.

If this is right

Raw token counts overstate usable information by a factor of roughly nine.
De-duplication before adaptation improves downstream encoder performance on disease-recognition benchmarks at equal token budget.
Redundancy skews the training distribution toward longer and more complex presentations.
Redundancy level depends on channel structure, with at least one channel nearly free of it.
Lossless compression independently confirms the two redundancy mechanisms without using provenance labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The released classification tool could be applied to audit redundancy in LLM corpora generated by other multi-agent pipelines.
Uncorrected redundancy may systematically bias models toward over-represented complex cases in any domain that reuses LLM output at scale.
Channel design choices in extraction pipelines offer a direct lever for reducing redundancy before training.

Load-bearing premise

The provenance labels correctly separate all sources of content without systematic mislabeling of copied or duplicated material.

What would settle it

Apply the identical pipeline and classification to a fresh collection of patient narratives and measure whether the trainable-unique fraction stays near 10.9 percent.

read the original abstract

Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale. We test whether volume reflects information content. Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce Provenance-based Redundancy Decomposition, a token-level classification of the entire output by source. Only 10.9% of the output is trainable-unique content while 79.4% is redundant; raw token count overstates information content by roughly ninefold. The redundancy arises through two distinct mechanisms, verbatim copying of source context into per-item fields, and duplication of generated text across records, of which only the former is losslessly removable. An independent, model-free analysis based on lossless compression confirms the redundancy, recovering the two mechanisms without reference to the provenance labels. One pipeline channel carries almost no redundancy, showing that the level of redundancy depends on how each channel is structured rather than being a fixed property of LLM extraction. Because uncorrected redundancy up-weights the longer, more complex presentations that generate the most items, it skews the token-level training distribution of the corpus, a property we measure directly. In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second benchmark, confirming that the redundancy carries a measurable cost beyond storage. The classification tool is released openly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows only 10.9% of a 2.51B-token LLM clinical corpus is trainable-unique, with provenance labels and compression both flagging the same redundancy patterns, and dedup improving downstream results at fixed budget.

read the letter

The core finding is that raw token counts in this LLM-generated clinical corpus overstate usable content by about nine times. Provenance decomposition on the full output tags 79.4% as redundant, split between verbatim source copying and cross-record duplication, with only the first cleanly removable. An independent compression check recovers the same split without using the labels, and a controlled adaptation experiment at equal token count shows measurable gains on external disease benchmarks after dedup, replicated across depths and a second task.

The work is strongest on the measurement side. Running the decomposition at production scale, releasing the tool, and adding the channel-level variation (one channel nearly clean) gives a practical handle on how pipeline design affects redundancy. The dual validation and fixed-budget test keep the claim grounded rather than just descriptive.

The main limitation is scope. Everything traces to one multi-agent extraction pipeline on 167k narratives, so the exact 10.9% figure and the two mechanisms are tied to that setup. Whether the same ratios appear in other generation methods or domains is left open. The abstract is unusually quantitative, but full definitions of trainable-unique and exclusion rules would help readers replicate the split.

This is useful for anyone building or auditing synthetic clinical corpora. It gives a concrete method and evidence that redundancy has a downstream cost, not just a storage one. The paper is coherent on its own terms and deserves a serious referee.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives (2.51 billion tokens across ten text-bearing channels). It introduces Provenance-based Redundancy Decomposition, a token-level classification by source, reporting that only 10.9% of the output is trainable-unique while 79.4% is redundant (via verbatim copying of source context and cross-record duplication). An independent model-free lossless compression analysis recovers the same mechanisms without using provenance labels. One channel shows near-zero redundancy. De-duplication before adaptation improves a clinical encoder on external disease-recognition benchmarks at fixed token budget, with results replicated across adaptation depths and a second benchmark. The classification tool is released openly.

Significance. If the central measurements hold, the work shows that raw token counts in LLM-generated clinical corpora can overstate information content by roughly ninefold and that uncorrected redundancy skews training distributions in measurable ways. The dual confirmation via provenance labels and label-free compression, plus the controlled downstream experiment at equal token budget, provide external grounding. The open release of the classification tool is a concrete reproducibility contribution.

minor comments (3)

[§3] The precise operational definition of 'trainable-unique' (distinct from the provenance categories) should be stated explicitly in §3 or §4 with an equation or pseudocode, as the abstract alone leaves room for ambiguity on edge cases such as partial overlaps.
[Results] Table or figure reporting per-channel redundancy percentages (mentioned for the near-zero channel) would benefit from an additional column showing token counts per channel to allow readers to assess the contribution of low-redundancy channels to the overall 10.9% figure.
[Downstream evaluation] The downstream experiment description should include the exact token budget used and the number of random seeds for the adaptation runs, even if results are described as robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of the manuscript, the positive assessment of its significance, and the recommendation of minor revision. No specific major comments appear in the report, so there are no individual points requiring point-by-point rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines Provenance-based Redundancy Decomposition directly from the pipeline's channel structure and source tracking on the generated output. This is not self-definitional because the classification is a literal accounting of token origins rather than a fit. An independent lossless compression analysis recovers the same two redundancy mechanisms (verbatim copying and cross-record duplication) without any reference to provenance labels. The downstream controlled experiment measures performance lift on external benchmarks at fixed token budget after de-duplication, providing external falsifiability. No self-citation is load-bearing, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported. The result is therefore not equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurement rests on the assumption that provenance labels provide an exhaustive and accurate partitioning of token sources. No free parameters or invented entities are described. The compression confirmation is model-free and therefore adds independent grounding.

axioms (1)

domain assumption The multi-agent pipeline's eleven channels produce text whose sources can be exhaustively tracked by provenance labels.
Invoked in the definition of Provenance-based Redundancy Decomposition.

pith-pipeline@v0.9.1-grok · 5840 in / 1340 out tokens · 48903 ms · 2026-06-30T07:06:33.760195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages

[1]

Bioinformatics36(4), 1234–1240 (2020)

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)

2020
[2]

ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)

2021
[3]

In: Advances in Neural Information Processing Systems, vol

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A.,et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

1901
[4]

In: Advances in Neural Information Processing Systems, vol

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,et al.: Lima: Less is more for alignment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021 (2023)

2023
[5]

Wettig, A., Gupta, A., Malik, S., Chen, D.: Qurating: Selecting high-quality data for training language models. In: Proceedings of the 41st International Conference 26 0 20 40 60 80 100 % of channel tokens QAR RE T emporal events Risk-QA Summary NER Risk-states Recommendations Risks Medications 81 10 80 7 47 21 51 20 20 46 71 68 26 context-copy dominated ...

2024
[6]

doi: 10.18653/v1/ 2024.findings-acl.348

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., Carlini, N.: Deduplicating training data makes language models better. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445 (2022). https://doi.org/10.18653/v1/ 2022.acl-long.577

work page doi:10.18653/v1/ 2022
[7]

Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8

Zhao, Z., Jin, Q., Chen, F., Peng, T., Yu, S.: A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8

work page doi:10.1038/s41597-023-02814-8 2023
[8]

Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294

Teahan, W.J.: A compression-based toolkit for modelling and processing nat- ural language text. Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294

2018
[9]

arXiv preprint arXiv:2506.10896 (2025)

Sounack, T., Davis, J., Durieux, B., Chaffin, A., Pollard, T.J., Lehman, E., John- son, A.E., McDermott, M., Naumann, T., Lindvall, C.: Bioclinical modernbert: A state-of-the-art long-context encoder for biomedical and clinical nlp. arXiv preprint arXiv:2506.10896 (2025)

work page arXiv 2025
[10]

Li, J., Sun, Y., Johnson, R.J.,et al.: Biocreative v cdr task corpus: a resource for 27 SOURCE NARRATIVE one note · stored once Each narrative is copied 43 times on average per patient (median 43, up to 83 for the most complex presentations) no copy adds information beyond the single source One narrative, copied verbatim across the context-bearing channel...

2016
[11]

IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090

Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090

work page doi:10.1109/tcom.1984.1096090 1984
[12]

IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10

Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10. 1109/TIT.1977.1055714

work page arXiv 1977
[13]

Technical Report 124, Digital Equipment Corporation (1994)

Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

1994
[14]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Warner, B., Chaffin, A., Clavi´ e, B., Weller, O., Hallstr¨ om, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T.,et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long con- text finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computation...

2025
[15]

This 77-year-old male patient was transferred to our ICU one week after his COVID-19 diagnosis due to continuing respiratory decompensation requiring intubation

Do˘ gan, R.I., Leaman, R., Lu, Z.: Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47, 1–10 (2014) https://doi.org/10.1016/j.jbi.2013.12.006 29 0 20 40 60 80 100 % of corpus tokens 4.2% 6.7% 67.5% 11.9% 9.7% Trainable-unique 10.9% Redundant 79.4% Scaffold 9.7% (a) Corpus composition...

work page doi:10.1016/j.jbi.2013.12.006 2014

[1] [1]

Bioinformatics36(4), 1234–1240 (2020)

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)

2020

[2] [2]

ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)

2021

[3] [3]

In: Advances in Neural Information Processing Systems, vol

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A.,et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

1901

[4] [4]

In: Advances in Neural Information Processing Systems, vol

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,et al.: Lima: Less is more for alignment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021 (2023)

2023

[5] [5]

Wettig, A., Gupta, A., Malik, S., Chen, D.: Qurating: Selecting high-quality data for training language models. In: Proceedings of the 41st International Conference 26 0 20 40 60 80 100 % of channel tokens QAR RE T emporal events Risk-QA Summary NER Risk-states Recommendations Risks Medications 81 10 80 7 47 21 51 20 20 46 71 68 26 context-copy dominated ...

2024

[6] [6]

doi: 10.18653/v1/ 2024.findings-acl.348

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., Carlini, N.: Deduplicating training data makes language models better. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445 (2022). https://doi.org/10.18653/v1/ 2022.acl-long.577

work page doi:10.18653/v1/ 2022

[7] [7]

Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8

Zhao, Z., Jin, Q., Chen, F., Peng, T., Yu, S.: A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8

work page doi:10.1038/s41597-023-02814-8 2023

[8] [8]

Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294

Teahan, W.J.: A compression-based toolkit for modelling and processing nat- ural language text. Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294

2018

[9] [9]

arXiv preprint arXiv:2506.10896 (2025)

Sounack, T., Davis, J., Durieux, B., Chaffin, A., Pollard, T.J., Lehman, E., John- son, A.E., McDermott, M., Naumann, T., Lindvall, C.: Bioclinical modernbert: A state-of-the-art long-context encoder for biomedical and clinical nlp. arXiv preprint arXiv:2506.10896 (2025)

work page arXiv 2025

[10] [10]

Li, J., Sun, Y., Johnson, R.J.,et al.: Biocreative v cdr task corpus: a resource for 27 SOURCE NARRATIVE one note · stored once Each narrative is copied 43 times on average per patient (median 43, up to 83 for the most complex presentations) no copy adds information beyond the single source One narrative, copied verbatim across the context-bearing channel...

2016

[11] [11]

IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090

Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090

work page doi:10.1109/tcom.1984.1096090 1984

[12] [12]

IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10

Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10. 1109/TIT.1977.1055714

work page arXiv 1977

[13] [13]

Technical Report 124, Digital Equipment Corporation (1994)

Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

1994

[14] [14]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Warner, B., Chaffin, A., Clavi´ e, B., Weller, O., Hallstr¨ om, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T.,et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long con- text finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computation...

2025

[15] [15]

This 77-year-old male patient was transferred to our ICU one week after his COVID-19 diagnosis due to continuing respiratory decompensation requiring intubation

Do˘ gan, R.I., Leaman, R., Lu, Z.: Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47, 1–10 (2014) https://doi.org/10.1016/j.jbi.2013.12.006 29 0 20 40 60 80 100 % of corpus tokens 4.2% 6.7% 67.5% 11.9% 9.7% Trainable-unique 10.9% Redundant 79.4% Scaffold 9.7% (a) Corpus composition...

work page doi:10.1016/j.jbi.2013.12.006 2014