pith. machine review for the scientific record. sign in

arxiv: 2605.09236 · v2 · submitted 2026-05-10 · 💻 cs.CL · cs.AI· cs.CY· cs.DL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.DLcs.IR
keywords semantic searchintellectual historyJohn Locke18th centurylexical baselinesimplicit receptionshistorical corporatext reuse
0
0 comments X

The pith

Semantic search finds substantially more implicit receptions of Locke's ideas than lexical matching in 18th-century texts, though surface vocabulary still shapes results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether semantic search can detect how writers engaged with John Locke's ideas through paraphrases and implicit connections rather than verbatim quotes. Experts created annotations based on a semantic taxonomy to mark these meaning-level matches in historical texts. Evaluation against lexical baselines shows semantic methods recover many more such receptions. Diagnostics indicate retrieval still depends partly on overlapping surface words. This work shows both the added reach and the remaining constraints when applying current semantic tools to large historical corpora.

Core claim

Using expert annotation grounded in a semantic taxonomy, the authors examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Results show semantic search retrieves substantially more implicit receptions than lexical baselines. Linguistic diagnostics reveal a lexical gatekeeping effect in which retrieval remains partially constrained by surface vocabulary overlap.

What carries the argument

Off-the-shelf semantic search pipeline evaluated against expert-annotated semantic taxonomy for implicit receptions of Locke.

If this is right

  • Historians gain access to paraphrased and complex implicit engagements with ideas that verbatim detection misses.
  • Large-scale tracing of idea circulation becomes feasible beyond direct quotations.
  • Retrieval performance stays influenced by lexical overlap, limiting full independence from surface forms.
  • Combining semantic and lexical approaches can improve coverage of intellectual transmission.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training embeddings on period-specific corpora could weaken the lexical gatekeeping effect observed here.
  • The same evaluation setup could be applied to receptions of other key authors or in adjacent centuries.
  • Historians might test hybrid retrieval pipelines that weight semantic and lexical signals differently for different research questions.

Load-bearing premise

Semantic embeddings trained mainly on modern text can reliably detect 18th-century meaning-level matches when measured against expert semantic annotations.

What would settle it

Expert re-annotation of a larger sample showing no meaningful increase in implicit receptions recovered by semantic search compared with lexical search would undermine the reported advantage.

Figures

Figures reproduced from arXiv: 2605.09236 by Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen, Yu Wu.

Figure 1
Figure 1. Figure 1: The overall evaluation workflow. The automated search pipeline (top) extracts and filters semantic [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The annotation interface screenshot. comes and the evidence for iterative deepening are detailed in Sec. 5. Source Selection From the top 1,000 most fre￾quently reused segments extracted in Sec. 4.3, we selected the top-5 highest-frequency quotes, sup￾plemented by 5 randomly sampled from each of three subsequent frequency tiers (ranks 6-50, 51- 150, and 151-1,000), yielding 20 search queries. Search and Fi… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity scores across categories, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proportion of significant hits within the 50 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Temporal and disciplinary count distribution of retrieved hits for Quotes 33 (left) and 464 (right). These [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates an off-the-shelf semantic search pipeline against lexical baselines for detecting implicit (non-verbatim) receptions of John Locke's works in 18th-century texts. Using expert annotations grounded in a semantic taxonomy, it claims that semantic search surfaces substantially more implicit receptions than lexical methods, while also documenting a 'lexical gatekeeping' effect in which retrieval remains partially constrained by surface-form vocabulary overlap. The data and annotations are released publicly.

Significance. If the quantitative results and embedding fidelity hold after the requested clarifications, the work would be significant for digital intellectual history and computational humanities. It supplies a concrete, reproducible case study that quantifies the gap between lexical reuse detection and meaning-level retrieval, while isolating a diagnostic limitation ('lexical gatekeeping') that future methods must address. The public release of the annotated dataset further strengthens its utility for the community.

major comments (2)
  1. [Abstract] Abstract: the headline claim that semantic search 'retrieves substantially more implicit receptions' is presented without any reported sample sizes, inter-annotator agreement figures, or statistical tests, rendering the magnitude and reliability of the improvement unverifiable from the given text.
  2. [Evaluation protocol] Evaluation protocol (implicit in the abstract and methods description): the central claim that off-the-shelf modern embeddings capture 18th-century meaning-level correspondences rests on expert annotations, yet the manuscript provides no direct validation of embedding fidelity to period usage (e.g., sense disambiguation accuracy on 18th-century vocabulary or comparison against embeddings trained on contemporary corpora).
minor comments (2)
  1. [Methods] The semantic taxonomy used for annotation is referenced but not described in sufficient detail (number of categories, inter-category distinctions, or examples of implicit vs. explicit reception).
  2. [Data availability] The GitHub repository link is given, but the manuscript should include a brief data statement summarizing the number of annotated pairs, annotation guidelines, and any preprocessing steps applied to the historical corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that semantic search 'retrieves substantially more implicit receptions' is presented without any reported sample sizes, inter-annotator agreement figures, or statistical tests, rendering the magnitude and reliability of the improvement unverifiable from the given text.

    Authors: We agree that the abstract should report these quantitative details to make the central claim immediately verifiable. In the revised version we will expand the abstract to state the total number of expert-annotated passages, the inter-annotator agreement (Cohen’s kappa), and the statistical test used to compare semantic versus lexical retrieval rates. These figures are already present in the main text and will now appear in the abstract as well. revision: yes

  2. Referee: [Evaluation protocol] Evaluation protocol (implicit in the abstract and methods description): the central claim that off-the-shelf modern embeddings capture 18th-century meaning-level correspondences rests on expert annotations, yet the manuscript provides no direct validation of embedding fidelity to period usage (e.g., sense disambiguation accuracy on 18th-century vocabulary or comparison against embeddings trained on contemporary corpora).

    Authors: The expert annotations, performed by specialists in 18th-century intellectual history and guided by an explicit semantic taxonomy, constitute our primary empirical validation that retrieved passages reflect meaning-level engagement with Locke. We did not, however, conduct separate sense-disambiguation accuracy tests on period vocabulary or train and compare against 18th-century-specific embeddings. In the revision we will add a new subsection in Methods that (a) justifies the use of off-the-shelf embeddings for reproducibility and (b) explicitly acknowledges the absence of these additional fidelity metrics as a limitation, while outlining how future work could address it. This clarification will be added without new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external expert annotations and standard lexical baselines

full rationale

The paper evaluates an off-the-shelf semantic search pipeline against expert annotations grounded in a semantic taxonomy and compares results directly to lexical baselines. No parameters are fitted to the evaluation data and then presented as predictions. No self-citations or prior author work are invoked as load-bearing uniqueness theorems or ansatzes. The reported 'lexical gatekeeping' effect is diagnosed via linguistic diagnostics on the retrieved outputs rather than assumed by construction. The derivation chain remains self-contained against external benchmarks with no reduction of claimed results to inputs by definition or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that modern semantic models transfer to historical language and that expert annotations constitute reliable ground truth for implicit meaning.

axioms (2)
  • domain assumption Semantic embeddings capture meaning-level correspondences beyond lexical overlap in 18th-century English
    Invoked when claiming semantic search surfaces implicit receptions overlooked by lexical methods.
  • domain assumption Expert annotations grounded in the semantic taxonomy provide valid ground truth
    Required for interpreting retrieval results as evidence of improved meaning matching.

pith-pipeline@v0.9.0 · 5467 in / 1202 out tokens · 27310 ms · 2026-05-13T07:44:27.832219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Motasem Alrahabi and Tom Wainstain. 2025. Versus: An automatic text comparison tool for the digital humanities. In Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities , pages 32--37, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria

  2. [2]

    David Armitage and Jo Guldi. 2014. The History Manifesto . Cambridge University Press

  3. [3]

    Emmanuelle Berm \`e s. 2017. Text, data and link-mining in digital libraries: Looking for the heritage gold. In IFLA Satellite Meeting - Digital Humanities -- Opportunities and Risks : Connecting Libraries and Research

  4. [4]

    David M. Blei. 2012. https://doi.org/10.1145/2133806.2133826 Probabilistic topic models . Commun. ACM, 55(4):77--84

  5. [5]

    Katherine Bode. 2018. A World of Fiction : Digital Collections and the Future of Literary History . University of Michigan Press, Ann Arbor

  6. [6]

    Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, and Lukas Weimer. 2020. Corpus REDEWIEDERGABE . In Proceedings of the Twelfth Language Resources and Evaluation Conference , pages 803--812, Marseille, France. European Language Resources Association

  7. [7]

    Simon Burrows and Mark Curran. 2012. The French Book Trade in Enlightenment Europe Project and the STN Database . Journal of Digital Humanities, 1(3)

  8. [8]

    Daniel Carey. 2006. https://doi.org/10.1017/CBO9780511490453 Locke, Shaftesbury , and Hutcheson : Contesting Diversity in the Enlightenment and Beyond . Ideas in Context . Cambridge University Press, Cambridge

  9. [9]

    Jeffrey R. Collins. 2020. In the Shadow of Leviathan : John Locke and the Politics of Conscience . Cambridge University Press. Google-Books-ID: 1tLKDwAAQBAJ

  10. [10]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2025. https://doi.org/10.1109/TBDATA.2025.3618474 The FAISS Library . IEEE Transactions on Big Data, pages 1--17

  11. [11]

    Marten D \"u ring, Matteo Romanello, Maud Ehrmann, Kaspar Beelen, Daniele Guido, Brecht Deseure, Estelle Bunout, Jana Keck, and Petros Apostolopoulos. 2023. https://doi.org/10.3389/fdata.2023.1249469 Impresso Text Reuse at Scale . An interface for the exploration of text reuse data in semantically enriched historical newspapers . Frontiers in Big Data, 6

  12. [12]

    Dan Edelstein. 2016. https://doi.org/10.1017/S1479244314000833 Intellectual History and Digital Humanities . Modern Intellectual History, 13(1):237--246

  13. [13]

    Dan Edelstein, Paula Findlen, Giovanna Ceserani, Caroline Winterer, and Nicole Coleman. 2017. https://doi.org/10.1093/ahr/122.2.400 Historical Research in a Digital Age : Reflections from the Mapping the Republic of Letters Project . The American Historical Review, 122(2):400--424

  14. [14]

    Robinson, Marc Alexander, Iona C

    Susan Fitzmaurice, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, and Fraser Dallachy. 2017. https://doi.org/10.1080/00393274.2017.1333891 Linguistic DNA : Investigating Conceptual Change in Early Modern English Discourse . Studia Neophilologica, 89(sup1):21--38

  15. [15]

    Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, and Melissa Dell. 2024. https://doi.org/10.18653/v1/2024.nlpcss-1.8 News Deja Vu : Connecting Past and Present with Semantic Search . In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science ( NLP + CSS 2024) , pages 99--112, Mexico City, Mexico. Associa...

  16. [16]

    Mario Giulianelli, Marco Del Tredici, and Raquel Fern \'a ndez. 2020. https://doi.org/10.18653/v1/2020.acl-main.365 Analysing lexical semantic change with contextualised word representations . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3960--3973, Online. Association for Computational Linguistics

  17. [17]

    Andrew Goldstone and Ted Underwood. 2014. The Quiet Transformations of Literary Studies : What Thirteen Thousand Scholars Could Tell Us . New Literary History, 45(3):359--384

  18. [18]

    Tyler Hanck. 2019. Locke's Confusion About the Confused Idea of Substance . Thesis, University of Illinois Chicago

  19. [19]

    James A. Harris. 2023. https://doi.org/10.1080/17496977.2022.2147475 Of the origin of government: The afterlives of Locke and Filmer in an eighteenth-century British debate . Intellectual History Review, 33(1):33--55

  20. [20]

    Hill, Ville Vaara, Tanja S \"a ily, Leo Lahti, and Mikko Tolonen

    Mark J. Hill, Ville Vaara, Tanja S \"a ily, Leo Lahti, and Mikko Tolonen. 2019. Reconstructing Intellectual Networks : From the ESTC 's bibliographic metadata to historical material. In Proceedings of the Digital Humanities in the Nordic Countries , Copenhagen, Denmark

  21. [21]

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. https://doi.org/10.5281/zenodo.1212303 spaCy : Industrial-strength natural language processing in python

  22. [22]

    Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, and Filip Ginter. 2025. https://doi.org/10.1007/s10579-023-09715-7 Semantic search as extractive paraphrase span detection . Language Resources and Evaluation, 59(1):257--276

  23. [23]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense Passage Retrieval for Open-Domain Question Answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 6769--6781, Online. ...

  24. [24]

    Pierre-Carl Langlais. 2021. https://doi.org/10.5281/zenodo.4751204 Fictions littéraires de Gallica / Literary fictions of Gallica

  25. [25]

    Daniel Layman. 2021. Locke's Republican and Liberal Legacy . In The Lockean Mind . Routledge. Num Pages: 10

  26. [26]

    Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d' Autume , Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. Mind the Gap : Assessing Temporal Generalization in Neural Language Models . In Advances in Neural Information Processing ...

  27. [27]

    H. P. Luhn. 1960. https://doi.org/10.1002/asi.5090110403 Key word-in-context index for technical literature (kwic index) . American Documentation, 11(4):288--295

  28. [28]

    Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2022. https://doi.org/10.1162/tacl_a_00457 ABNIRML : Analyzing the Behavior of Neural IR Models . Transactions of the Association for Computational Linguistics, 10:224--239

  29. [29]

    Ananth Mahadevan, Michael Mathioudakis, Eetu M \"a kel \"a , and Mikko Tolonen. 2025. https://doi.org/10.1007/s41060-025-00742-x Text reuse in large historical corpora: Insights from the optimization of a data science system . International Journal of Data Science and Analytics, 20(5):4631--4643

  30. [30]

    Barbara McGillivray, Federico Nanni, and Kaspar Beelen. 2024. 10. Why Does Digital History Need Diachronic Semantic Search ? In Computational Humanities . University of Minnesota Press

  31. [31]

    Andrianos Michail, Juri Opitz, Yining Wang, Robin Meister, Rico Sennrich, and Simon Clematide. 2025. https://doi.org/10.18653/v1/2025.findings-acl.609 Cheap Character Noise for OCR-Robust Multilingual Embeddings . In Findings of the Association for Computational Linguistics : ACL 2025 , pages 11705--11716, Vienna, Austria. Association for Computational Li...

  32. [32]

    Grace Muzny, Michael Fang, Angel Chang, and Dan Jurafsky. 2017. A Two-stage Sieve Approach for Quote Attribution . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics : Volume 1, Long Papers , pages 460--470, Valencia, Spain. Association for Computational Linguistics

  33. [33]

    Eetu Mäkelä, James Misson, Devani Singh, and Mikko Tolonen. 2025. https://doi.org/10.1093/llc/fqaf086 Opening the Black Box of EEBO . Digital Scholarship in the Humanities, page fqaf086

  34. [34]

    R. Porter. 2001. Enlightenment: Britain and the Creation of the Modern World . Penguin Books Limited

  35. [35]

    Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

  36. [36]

    Nils Reimers and Iryna Gurevych. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.365 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 4512--4525, Online. Association for Computational Linguistics

  37. [37]

    Hammad Rizwan, Domenic Rosati, Ga Wu, and Hassan Sajjad. 2025. Resolving Lexical Bias in Model Editing . In Forty-Second International Conference on Machine Learning

  38. [38]

    Glenn Roe. 2024. https://doi.org/10.61147/des.23 Text reuse as cultural practice: Intertextuality in the 18th-century digital archive . Digital Enlightenment Studies, 2(1)

  39. [39]

    David Rosson, Eetu Mäkelä, Ville Vaara, Ananth Mahadevan, Yann Ryan, and Mikko Tolonen. 2023. https://doi.org/10.5334/johd.101 Reception Reader : Exploring Text Reuse in Early Modern British Publications . Journal of Open Humanities Data, 9:5. ArXiv:2302.04084 [cs]

  40. [40]

    Julian Schelb, Michael Wittweiler, Marie Revellio, Barbara Feichtinger, and Andreas Spitz. 2026. https://doi.org/10.48550/arXiv.2601.07533 Loci Similes : A Benchmark for Extracting Intertextualities in Latin Literature . Preprint, arXiv:2601.07533

  41. [41]

    Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.496 Simple Entity-Centric Questions Challenge Dense Retrievers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 6138--6148, Online and Punta Cana, Dominican Republic. Association for Comput...

  42. [42]

    Smith, Ryan Cordel, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson

    David A. Smith, Ryan Cordel, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. https://doi.org/10.1109/JCDL.2014.6970166 Detecting and modeling local text reuse . In IEEE / ACM Joint Conference on Digital Libraries , pages 183--192

  43. [43]

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet : Masked and permuted pre-training for language understanding. In Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS '20, pages 16857--16867, Red Hook, NY, USA. Curran Associates Inc

  44. [44]

    Peter M. Stahl. 2021. Lingua : The most accurate natural language detection library for Python . https://github.com/pemistahl/lingua-py. Python bindings for the Lingua language detection library

  45. [45]

    Timothy Stanton. 2018. https://doi.org/10.1017/S0018246X17000450 John Locke and the Fable of Liberalism . The Historical Journal, 61(3):597--622

  46. [46]

    Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR : A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models . In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track ( Round 2)

  47. [47]

    Iiro Tiihonen and Kira Hinderks. 2025. https://doi.org/10.46298/transformations.14754 Genre Classification Workflow for the English Short Title Catalogue ( ESTC ) . Transformations: A DARIAH Journal, Workflows(Metadata-based workflows):14754

  48. [48]

    2020-2025

    Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-2025. https://github.com/HumanSignal/label-studio Label Studio : Data labeling software . Open source software available from https://github.com/HumanSignal/label-studio

  49. [49]

    Mikko Tolonen, Eetu Mäkelä, and Leo Lahti. 2022. https://muse.jhu.edu/pub/1/article/867734 The Anatomy of Eighteenth Century Collections Online ( ECCO ) . Eighteenth-Century Studies, 56(1):95--123. Publisher: Johns Hopkins University Press

  50. [50]

    Mikko Tolonen and Yann Ciar \'a n Ryan. 2026. Computational Methods in Intellectual History, pages 239--260. Proceedings of the British Academy. Liverpool University Press, United Kingdom

  51. [51]

    Mikko Tolonen and Mark G. Spencer. 2025. https://doi.org/10.1017/9781009047227.003 The Reception of David Hume ’s Essays in Eighteenth - Century Britain . In Max Skjönsberg and Felix Waldmann, editors, Hume's Essays , Cambridge Critical Guides , pages 15--35. Cambridge University Press, Cambridge

  52. [52]

    Aleksi Vesanto, Filip Ginter, Hannu Salmi, Asko Nivala, and Tapio Salakoski. 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora . In Proceedings of the 21st Nordic Conference on Computational Linguistics , pages 330--333, Gothenburg, Sweden. Association for Computational Linguistics

  53. [53]

    Warren, Daniel Shore, Jessica Otis, Lawrence Wang, Mike Finegold, and Cosma Shalizi

    Christopher N. Warren, Daniel Shore, Jessica Otis, Lawrence Wang, Mike Finegold, and Cosma Shalizi. 2016. Six Degrees of Francis Bacon: A Statistical Method for Reconstructing Large Historical Social Networks. DHQ: Digital Humanities Quarterly, 10(3):1

  54. [54]

    Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. 2025. OCR Hinders RAG : Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pages 17443--17453