How much of an LLM-generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification
Pith reviewed 2026-06-30 07:06 UTC · model grok-4.3
The pith
Provenance classification shows an LLM clinical extraction pipeline produces only 10.9 percent trainable-unique content amid 79.4 percent redundancy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Only 10.9 percent of the output is trainable-unique content while 79.4 percent is redundant, so raw token count overstates information content by roughly ninefold. The redundancy arises through verbatim copying of source context into per-item fields and duplication of generated text across records, of which only the former is losslessly removable. An independent compression analysis recovers the same mechanisms, one pipeline channel shows almost no redundancy, and the skew affects the token-level training distribution. De-duplicating the corpus before adaptation improves a clinical encoder on external disease-recognition benchmarks at equal token budget across adaptation depths and a second
What carries the argument
Provenance-based Redundancy Decomposition, a token-level classification of every generated token by its originating source.
If this is right
- Raw token counts overstate usable information by a factor of roughly nine.
- De-duplication before adaptation improves downstream encoder performance on disease-recognition benchmarks at equal token budget.
- Redundancy skews the training distribution toward longer and more complex presentations.
- Redundancy level depends on channel structure, with at least one channel nearly free of it.
- Lossless compression independently confirms the two redundancy mechanisms without using provenance labels.
Where Pith is reading between the lines
- The released classification tool could be applied to audit redundancy in LLM corpora generated by other multi-agent pipelines.
- Uncorrected redundancy may systematically bias models toward over-represented complex cases in any domain that reuses LLM output at scale.
- Channel design choices in extraction pipelines offer a direct lever for reducing redundancy before training.
Load-bearing premise
The provenance labels correctly separate all sources of content without systematic mislabeling of copied or duplicated material.
What would settle it
Apply the identical pipeline and classification to a fresh collection of patient narratives and measure whether the trainable-unique fraction stays near 10.9 percent.
read the original abstract
Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale. We test whether volume reflects information content. Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce Provenance-based Redundancy Decomposition, a token-level classification of the entire output by source. Only 10.9% of the output is trainable-unique content while 79.4% is redundant; raw token count overstates information content by roughly ninefold. The redundancy arises through two distinct mechanisms, verbatim copying of source context into per-item fields, and duplication of generated text across records, of which only the former is losslessly removable. An independent, model-free analysis based on lossless compression confirms the redundancy, recovering the two mechanisms without reference to the provenance labels. One pipeline channel carries almost no redundancy, showing that the level of redundancy depends on how each channel is structured rather than being a fixed property of LLM extraction. Because uncorrected redundancy up-weights the longer, more complex presentations that generate the most items, it skews the token-level training distribution of the corpus, a property we measure directly. In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second benchmark, confirming that the redundancy carries a measurable cost beyond storage. The classification tool is released openly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives (2.51 billion tokens across ten text-bearing channels). It introduces Provenance-based Redundancy Decomposition, a token-level classification by source, reporting that only 10.9% of the output is trainable-unique while 79.4% is redundant (via verbatim copying of source context and cross-record duplication). An independent model-free lossless compression analysis recovers the same mechanisms without using provenance labels. One channel shows near-zero redundancy. De-duplication before adaptation improves a clinical encoder on external disease-recognition benchmarks at fixed token budget, with results replicated across adaptation depths and a second benchmark. The classification tool is released openly.
Significance. If the central measurements hold, the work shows that raw token counts in LLM-generated clinical corpora can overstate information content by roughly ninefold and that uncorrected redundancy skews training distributions in measurable ways. The dual confirmation via provenance labels and label-free compression, plus the controlled downstream experiment at equal token budget, provide external grounding. The open release of the classification tool is a concrete reproducibility contribution.
minor comments (3)
- [§3] The precise operational definition of 'trainable-unique' (distinct from the provenance categories) should be stated explicitly in §3 or §4 with an equation or pseudocode, as the abstract alone leaves room for ambiguity on edge cases such as partial overlaps.
- [Results] Table or figure reporting per-channel redundancy percentages (mentioned for the near-zero channel) would benefit from an additional column showing token counts per channel to allow readers to assess the contribution of low-redundancy channels to the overall 10.9% figure.
- [Downstream evaluation] The downstream experiment description should include the exact token budget used and the number of random seeds for the adaptation runs, even if results are described as robust.
Simulated Author's Rebuttal
We thank the referee for the detailed summary of the manuscript, the positive assessment of its significance, and the recommendation of minor revision. No specific major comments appear in the report, so there are no individual points requiring point-by-point rebuttal or revision at this stage.
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper defines Provenance-based Redundancy Decomposition directly from the pipeline's channel structure and source tracking on the generated output. This is not self-definitional because the classification is a literal accounting of token origins rather than a fit. An independent lossless compression analysis recovers the same two redundancy mechanisms (verbatim copying and cross-record duplication) without any reference to provenance labels. The downstream controlled experiment measures performance lift on external benchmarks at fixed token budget after de-duplication, providing external falsifiability. No self-citation is load-bearing, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported. The result is therefore not equivalent to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The multi-agent pipeline's eleven channels produce text whose sources can be exhaustively tracked by provenance labels.
Reference graph
Works this paper leans on
-
[1]
Bioinformatics36(4), 1234–1240 (2020)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)
2020
-
[2]
ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (2021)
2021
-
[3]
In: Advances in Neural Information Processing Systems, vol
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A.,et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
1901
-
[4]
In: Advances in Neural Information Processing Systems, vol
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,et al.: Lima: Less is more for alignment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021 (2023)
2023
-
[5]
Wettig, A., Gupta, A., Malik, S., Chen, D.: Qurating: Selecting high-quality data for training language models. In: Proceedings of the 41st International Conference 26 0 20 40 60 80 100 % of channel tokens QAR RE T emporal events Risk-QA Summary NER Risk-states Recommendations Risks Medications 81 10 80 7 47 21 51 20 20 46 71 68 26 context-copy dominated ...
2024
-
[6]
doi: 10.18653/v1/ 2024.findings-acl.348
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., Carlini, N.: Deduplicating training data makes language models better. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445 (2022). https://doi.org/10.18653/v1/ 2022.acl-long.577
-
[7]
Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8
Zhao, Z., Jin, Q., Chen, F., Peng, T., Yu, S.: A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Scientific Data 10(1), 909 (2023) https://doi.org/10.1038/s41597-023-02814-8
-
[8]
Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294
Teahan, W.J.: A compression-based toolkit for modelling and processing nat- ural language text. Information9(12), 294 (2018) https://doi.org/10.3390/ info9120294
2018
-
[9]
arXiv preprint arXiv:2506.10896 (2025)
Sounack, T., Davis, J., Durieux, B., Chaffin, A., Pollard, T.J., Lehman, E., John- son, A.E., McDermott, M., Naumann, T., Lindvall, C.: Bioclinical modernbert: A state-of-the-art long-context encoder for biomedical and clinical nlp. arXiv preprint arXiv:2506.10896 (2025)
-
[10]
Li, J., Sun, Y., Johnson, R.J.,et al.: Biocreative v cdr task corpus: a resource for 27 SOURCE NARRATIVE one note · stored once Each narrative is copied 43 times on average per patient (median 43, up to 83 for the most complex presentations) no copy adds information beyond the single source One narrative, copied verbatim across the context-bearing channel...
2016
-
[11]
IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications32(4), 396–402 (1984) https://doi.org/10.1109/TCOM.1984.1096090
-
[12]
IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory23(3), 337–343 (1977) https://doi.org/10. 1109/TIT.1977.1055714
-
[13]
Technical Report 124, Digital Equipment Corporation (1994)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
1994
-
[14]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp
Warner, B., Chaffin, A., Clavi´ e, B., Weller, O., Hallstr¨ om, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T.,et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long con- text finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computation...
2025
-
[15]
Do˘ gan, R.I., Leaman, R., Lu, Z.: Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47, 1–10 (2014) https://doi.org/10.1016/j.jbi.2013.12.006 29 0 20 40 60 80 100 % of corpus tokens 4.2% 6.7% 67.5% 11.9% 9.7% Trainable-unique 10.9% Redundant 79.4% Scaffold 9.7% (a) Corpus composition...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.