arxiv: 2605.09236 · v2 · submitted 2026-05-10 · 💻 cs.CL · cs.AI· cs.CY· cs.DL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Yu Wu , Ananth Mahadevan , Filip Ginter , Michael Mathioudakis , Mikko Tolonen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.DLcs.IR

keywords semantic searchintellectual historyJohn Locke18th centurylexical baselinesimplicit receptionshistorical corporatext reuse

0 comments

The pith

Semantic search finds substantially more implicit receptions of Locke's ideas than lexical matching in 18th-century texts, though surface vocabulary still shapes results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether semantic search can detect how writers engaged with John Locke's ideas through paraphrases and implicit connections rather than verbatim quotes. Experts created annotations based on a semantic taxonomy to mark these meaning-level matches in historical texts. Evaluation against lexical baselines shows semantic methods recover many more such receptions. Diagnostics indicate retrieval still depends partly on overlapping surface words. This work shows both the added reach and the remaining constraints when applying current semantic tools to large historical corpora.

Core claim

Using expert annotation grounded in a semantic taxonomy, the authors examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Results show semantic search retrieves substantially more implicit receptions than lexical baselines. Linguistic diagnostics reveal a lexical gatekeeping effect in which retrieval remains partially constrained by surface vocabulary overlap.

What carries the argument

Off-the-shelf semantic search pipeline evaluated against expert-annotated semantic taxonomy for implicit receptions of Locke.

If this is right

Historians gain access to paraphrased and complex implicit engagements with ideas that verbatim detection misses.
Large-scale tracing of idea circulation becomes feasible beyond direct quotations.
Retrieval performance stays influenced by lexical overlap, limiting full independence from surface forms.
Combining semantic and lexical approaches can improve coverage of intellectual transmission.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training embeddings on period-specific corpora could weaken the lexical gatekeeping effect observed here.
The same evaluation setup could be applied to receptions of other key authors or in adjacent centuries.
Historians might test hybrid retrieval pipelines that weight semantic and lexical signals differently for different research questions.

Load-bearing premise

Semantic embeddings trained mainly on modern text can reliably detect 18th-century meaning-level matches when measured against expert semantic annotations.

What would settle it

Expert re-annotation of a larger sample showing no meaningful increase in implicit receptions recovered by semantic search compared with lexical search would undermine the reported advantage.

Figures

Figures reproduced from arXiv: 2605.09236 by Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen, Yu Wu.

**Figure 2.** Figure 2: The annotation interface screenshot. comes and the evidence for iterative deepening are detailed in Sec. 5. Source Selection From the top 1,000 most frequently reused segments extracted in Sec. 4.3, we selected the top-5 highest-frequency quotes, supplemented by 5 randomly sampled from each of three subsequent frequency tiers (ranks 6-50, 51- 150, and 151-1,000), yielding 20 search queries. Search and Fi… view at source ↗

**Figure 3.** Figure 3: Cosine similarity scores across categories, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Proportion of significant hits within the 50 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Temporal and disciplinary count distribution of retrieved hits for Quotes 33 (left) and 464 (right). These [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic search beats lexical baselines on implicit Locke receptions but still depends heavily on surface overlap, with thin validation details.

read the letter

The core finding here is that an off-the-shelf semantic pipeline surfaces more non-verbatim receptions of Locke than lexical reuse detection does, yet retrieval stays partly gated by shared vocabulary. That lexical gatekeeping observation is the most useful part of the work. They built an expert-annotated taxonomy for the evaluation and released the data, which is a concrete step forward for anyone trying to move beyond string matching in 18th-century corpora. The comparison to standard lexical baselines is straightforward and the directional result holds up in the abstract. What is new is the specific application to Locke reception plus the diagnostic that modern embeddings do not fully decouple from surface forms in this period. The paper does not claim the embeddings are perfect; it flags the limitation. Soft spots are the missing numbers: no inter-annotator agreement, no sample sizes, no statistical tests on the improvement. Without those, it is hard to judge how robust the gain actually is. The stress-test concern about modern embeddings on historical text is fair; the paper does not run a period-specific model or control directly for lexical overlap in the retrieval scores, so the semantic advantage is not fully isolated. This is the kind of paper that belongs in a digital humanities venue or a methods-focused history journal. Readers working on idea circulation or large-scale historical search will get practical value from the data release and the gatekeeping note. It is not a breakthrough but it is honest empirical work with public artifacts. I would send it to peer review rather than desk reject; the evaluation setup is clear enough that referees can ask for the missing stats and controls.

Referee Report

2 major / 2 minor

Summary. The paper evaluates an off-the-shelf semantic search pipeline against lexical baselines for detecting implicit (non-verbatim) receptions of John Locke's works in 18th-century texts. Using expert annotations grounded in a semantic taxonomy, it claims that semantic search surfaces substantially more implicit receptions than lexical methods, while also documenting a 'lexical gatekeeping' effect in which retrieval remains partially constrained by surface-form vocabulary overlap. The data and annotations are released publicly.

Significance. If the quantitative results and embedding fidelity hold after the requested clarifications, the work would be significant for digital intellectual history and computational humanities. It supplies a concrete, reproducible case study that quantifies the gap between lexical reuse detection and meaning-level retrieval, while isolating a diagnostic limitation ('lexical gatekeeping') that future methods must address. The public release of the annotated dataset further strengthens its utility for the community.

major comments (2)

[Abstract] Abstract: the headline claim that semantic search 'retrieves substantially more implicit receptions' is presented without any reported sample sizes, inter-annotator agreement figures, or statistical tests, rendering the magnitude and reliability of the improvement unverifiable from the given text.
[Evaluation protocol] Evaluation protocol (implicit in the abstract and methods description): the central claim that off-the-shelf modern embeddings capture 18th-century meaning-level correspondences rests on expert annotations, yet the manuscript provides no direct validation of embedding fidelity to period usage (e.g., sense disambiguation accuracy on 18th-century vocabulary or comparison against embeddings trained on contemporary corpora).

minor comments (2)

[Methods] The semantic taxonomy used for annotation is referenced but not described in sufficient detail (number of categories, inter-category distinctions, or examples of implicit vs. explicit reception).
[Data availability] The GitHub repository link is given, but the manuscript should include a brief data statement summarizing the number of annotated pairs, annotation guidelines, and any preprocessing steps applied to the historical corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that semantic search 'retrieves substantially more implicit receptions' is presented without any reported sample sizes, inter-annotator agreement figures, or statistical tests, rendering the magnitude and reliability of the improvement unverifiable from the given text.

Authors: We agree that the abstract should report these quantitative details to make the central claim immediately verifiable. In the revised version we will expand the abstract to state the total number of expert-annotated passages, the inter-annotator agreement (Cohen’s kappa), and the statistical test used to compare semantic versus lexical retrieval rates. These figures are already present in the main text and will now appear in the abstract as well. revision: yes
Referee: [Evaluation protocol] Evaluation protocol (implicit in the abstract and methods description): the central claim that off-the-shelf modern embeddings capture 18th-century meaning-level correspondences rests on expert annotations, yet the manuscript provides no direct validation of embedding fidelity to period usage (e.g., sense disambiguation accuracy on 18th-century vocabulary or comparison against embeddings trained on contemporary corpora).

Authors: The expert annotations, performed by specialists in 18th-century intellectual history and guided by an explicit semantic taxonomy, constitute our primary empirical validation that retrieved passages reflect meaning-level engagement with Locke. We did not, however, conduct separate sense-disambiguation accuracy tests on period vocabulary or train and compare against 18th-century-specific embeddings. In the revision we will add a new subsection in Methods that (a) justifies the use of off-the-shelf embeddings for reproducibility and (b) explicitly acknowledges the absence of these additional fidelity metrics as a limitation, while outlining how future work could address it. This clarification will be added without new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external expert annotations and standard lexical baselines

full rationale

The paper evaluates an off-the-shelf semantic search pipeline against expert annotations grounded in a semantic taxonomy and compares results directly to lexical baselines. No parameters are fitted to the evaluation data and then presented as predictions. No self-citations or prior author work are invoked as load-bearing uniqueness theorems or ansatzes. The reported 'lexical gatekeeping' effect is diagnosed via linguistic diagnostics on the retrieved outputs rather than assumed by construction. The derivation chain remains self-contained against external benchmarks with no reduction of claimed results to inputs by definition or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that modern semantic models transfer to historical language and that expert annotations constitute reliable ground truth for implicit meaning.

axioms (2)

domain assumption Semantic embeddings capture meaning-level correspondences beyond lexical overlap in 18th-century English
Invoked when claiming semantic search surfaces implicit receptions overlooked by lexical methods.
domain assumption Expert annotations grounded in the semantic taxonomy provide valid ground truth
Required for interpreting retrieval results as evidence of improved meaning matching.

pith-pipeline@v0.9.0 · 5467 in / 1202 out tokens · 27310 ms · 2026-05-13T07:44:27.832219+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a deliberately minimal dense retrieval-based search pipeline... encoded using the paraphrase-multilingual-mpnet-base-v2 model... Efficient vector indexing and retrieval were executed via FAISS
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

heuristic annotation taxonomy... Lexical Matches, Paraphrase Matches, Meaning Matches, Topical Matches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Motasem Alrahabi and Tom Wainstain. 2025. Versus: An automatic text comparison tool for the digital humanities. In Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities , pages 32--37, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria

work page 2025
[2]

David Armitage and Jo Guldi. 2014. The History Manifesto . Cambridge University Press

work page 2014
[3]

Emmanuelle Berm \`e s. 2017. Text, data and link-mining in digital libraries: Looking for the heritage gold. In IFLA Satellite Meeting - Digital Humanities -- Opportunities and Risks : Connecting Libraries and Research

work page 2017
[4]

David M. Blei. 2012. https://doi.org/10.1145/2133806.2133826 Probabilistic topic models . Commun. ACM, 55(4):77--84

work page doi:10.1145/2133806.2133826 2012
[5]

Katherine Bode. 2018. A World of Fiction : Digital Collections and the Future of Literary History . University of Michigan Press, Ann Arbor

work page 2018
[6]

Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, and Lukas Weimer. 2020. Corpus REDEWIEDERGABE . In Proceedings of the Twelfth Language Resources and Evaluation Conference , pages 803--812, Marseille, France. European Language Resources Association

work page 2020
[7]

Simon Burrows and Mark Curran. 2012. The French Book Trade in Enlightenment Europe Project and the STN Database . Journal of Digital Humanities, 1(3)

work page 2012
[8]

Daniel Carey. 2006. https://doi.org/10.1017/CBO9780511490453 Locke, Shaftesbury , and Hutcheson : Contesting Diversity in the Enlightenment and Beyond . Ideas in Context . Cambridge University Press, Cambridge

work page doi:10.1017/cbo9780511490453 2006
[9]

Jeffrey R. Collins. 2020. In the Shadow of Leviathan : John Locke and the Politics of Conscience . Cambridge University Press. Google-Books-ID: 1tLKDwAAQBAJ

work page 2020
[10]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2025. https://doi.org/10.1109/TBDATA.2025.3618474 The FAISS Library . IEEE Transactions on Big Data, pages 1--17

work page doi:10.1109/tbdata.2025.3618474 2025
[11]

Marten D \"u ring, Matteo Romanello, Maud Ehrmann, Kaspar Beelen, Daniele Guido, Brecht Deseure, Estelle Bunout, Jana Keck, and Petros Apostolopoulos. 2023. https://doi.org/10.3389/fdata.2023.1249469 Impresso Text Reuse at Scale . An interface for the exploration of text reuse data in semantically enriched historical newspapers . Frontiers in Big Data, 6

work page doi:10.3389/fdata.2023.1249469 2023
[12]

Dan Edelstein. 2016. https://doi.org/10.1017/S1479244314000833 Intellectual History and Digital Humanities . Modern Intellectual History, 13(1):237--246

work page doi:10.1017/s1479244314000833 2016
[13]

Dan Edelstein, Paula Findlen, Giovanna Ceserani, Caroline Winterer, and Nicole Coleman. 2017. https://doi.org/10.1093/ahr/122.2.400 Historical Research in a Digital Age : Reflections from the Mapping the Republic of Letters Project . The American Historical Review, 122(2):400--424

work page doi:10.1093/ahr/122.2.400 2017
[14]

Robinson, Marc Alexander, Iona C

Susan Fitzmaurice, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, and Fraser Dallachy. 2017. https://doi.org/10.1080/00393274.2017.1333891 Linguistic DNA : Investigating Conceptual Change in Early Modern English Discourse . Studia Neophilologica, 89(sup1):21--38

work page doi:10.1080/00393274.2017.1333891 2017
[15]

Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, and Melissa Dell. 2024. https://doi.org/10.18653/v1/2024.nlpcss-1.8 News Deja Vu : Connecting Past and Present with Semantic Search . In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science ( NLP + CSS 2024) , pages 99--112, Mexico City, Mexico. Associa...

work page doi:10.18653/v1/2024.nlpcss-1.8 2024
[16]

Mario Giulianelli, Marco Del Tredici, and Raquel Fern \'a ndez. 2020. https://doi.org/10.18653/v1/2020.acl-main.365 Analysing lexical semantic change with contextualised word representations . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3960--3973, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.365 2020
[17]

Andrew Goldstone and Ted Underwood. 2014. The Quiet Transformations of Literary Studies : What Thirteen Thousand Scholars Could Tell Us . New Literary History, 45(3):359--384

work page 2014
[18]

Tyler Hanck. 2019. Locke's Confusion About the Confused Idea of Substance . Thesis, University of Illinois Chicago

work page 2019
[19]

James A. Harris. 2023. https://doi.org/10.1080/17496977.2022.2147475 Of the origin of government: The afterlives of Locke and Filmer in an eighteenth-century British debate . Intellectual History Review, 33(1):33--55

work page doi:10.1080/17496977.2022.2147475 2023
[20]

Hill, Ville Vaara, Tanja S \"a ily, Leo Lahti, and Mikko Tolonen

Mark J. Hill, Ville Vaara, Tanja S \"a ily, Leo Lahti, and Mikko Tolonen. 2019. Reconstructing Intellectual Networks : From the ESTC 's bibliographic metadata to historical material. In Proceedings of the Digital Humanities in the Nordic Countries , Copenhagen, Denmark

work page 2019
[21]

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. https://doi.org/10.5281/zenodo.1212303 spaCy : Industrial-strength natural language processing in python

work page doi:10.5281/zenodo.1212303 2020
[22]

Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, and Filip Ginter. 2025. https://doi.org/10.1007/s10579-023-09715-7 Semantic search as extractive paraphrase span detection . Language Resources and Evaluation, 59(1):257--276

work page doi:10.1007/s10579-023-09715-7 2025
[23]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense Passage Retrieval for Open-Domain Question Answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 6769--6781, Online. ...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[24]

Pierre-Carl Langlais. 2021. https://doi.org/10.5281/zenodo.4751204 Fictions littéraires de Gallica / Literary fictions of Gallica

work page doi:10.5281/zenodo.4751204 2021
[25]

Daniel Layman. 2021. Locke's Republican and Liberal Legacy . In The Lockean Mind . Routledge. Num Pages: 10

work page 2021
[26]

Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d' Autume , Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. Mind the Gap : Assessing Temporal Generalization in Neural Language Models . In Advances in Neural Information Processing ...

work page 2021
[27]

H. P. Luhn. 1960. https://doi.org/10.1002/asi.5090110403 Key word-in-context index for technical literature (kwic index) . American Documentation, 11(4):288--295

work page doi:10.1002/asi.5090110403 1960
[28]

Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2022. https://doi.org/10.1162/tacl_a_00457 ABNIRML : Analyzing the Behavior of Neural IR Models . Transactions of the Association for Computational Linguistics, 10:224--239

work page doi:10.1162/tacl_a_00457 2022
[29]

Ananth Mahadevan, Michael Mathioudakis, Eetu M \"a kel \"a , and Mikko Tolonen. 2025. https://doi.org/10.1007/s41060-025-00742-x Text reuse in large historical corpora: Insights from the optimization of a data science system . International Journal of Data Science and Analytics, 20(5):4631--4643

work page doi:10.1007/s41060-025-00742-x 2025
[30]

Barbara McGillivray, Federico Nanni, and Kaspar Beelen. 2024. 10. Why Does Digital History Need Diachronic Semantic Search ? In Computational Humanities . University of Minnesota Press

work page 2024
[31]

Andrianos Michail, Juri Opitz, Yining Wang, Robin Meister, Rico Sennrich, and Simon Clematide. 2025. https://doi.org/10.18653/v1/2025.findings-acl.609 Cheap Character Noise for OCR-Robust Multilingual Embeddings . In Findings of the Association for Computational Linguistics : ACL 2025 , pages 11705--11716, Vienna, Austria. Association for Computational Li...

work page doi:10.18653/v1/2025.findings-acl.609 2025
[32]

Grace Muzny, Michael Fang, Angel Chang, and Dan Jurafsky. 2017. A Two-stage Sieve Approach for Quote Attribution . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics : Volume 1, Long Papers , pages 460--470, Valencia, Spain. Association for Computational Linguistics

work page 2017
[33]

Eetu Mäkelä, James Misson, Devani Singh, and Mikko Tolonen. 2025. https://doi.org/10.1093/llc/fqaf086 Opening the Black Box of EEBO . Digital Scholarship in the Humanities, page fqaf086

work page doi:10.1093/llc/fqaf086 2025
[34]

R. Porter. 2001. Enlightenment: Britain and the Creation of the Modern World . Penguin Books Limited

work page 2001
[35]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

work page doi:10.18653/v1/d19-1410 2019
[36]

Nils Reimers and Iryna Gurevych. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.365 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 4512--4525, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.365 2020
[37]

Hammad Rizwan, Domenic Rosati, Ga Wu, and Hassan Sajjad. 2025. Resolving Lexical Bias in Model Editing . In Forty-Second International Conference on Machine Learning

work page 2025
[38]

Glenn Roe. 2024. https://doi.org/10.61147/des.23 Text reuse as cultural practice: Intertextuality in the 18th-century digital archive . Digital Enlightenment Studies, 2(1)

work page doi:10.61147/des.23 2024
[39]

David Rosson, Eetu Mäkelä, Ville Vaara, Ananth Mahadevan, Yann Ryan, and Mikko Tolonen. 2023. https://doi.org/10.5334/johd.101 Reception Reader : Exploring Text Reuse in Early Modern British Publications . Journal of Open Humanities Data, 9:5. ArXiv:2302.04084 [cs]

work page doi:10.5334/johd.101 2023
[40]

Julian Schelb, Michael Wittweiler, Marie Revellio, Barbara Feichtinger, and Andreas Spitz. 2026. https://doi.org/10.48550/arXiv.2601.07533 Loci Similes : A Benchmark for Extracting Intertextualities in Latin Literature . Preprint, arXiv:2601.07533

work page doi:10.48550/arxiv.2601.07533 2026
[41]

Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.496 Simple Entity-Centric Questions Challenge Dense Retrievers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 6138--6148, Online and Punta Cana, Dominican Republic. Association for Comput...

work page doi:10.18653/v1/2021.emnlp-main.496 2021
[42]

Smith, Ryan Cordel, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson

David A. Smith, Ryan Cordel, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. https://doi.org/10.1109/JCDL.2014.6970166 Detecting and modeling local text reuse . In IEEE / ACM Joint Conference on Digital Libraries , pages 183--192

work page doi:10.1109/jcdl.2014.6970166 2014
[43]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet : Masked and permuted pre-training for language understanding. In Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS '20, pages 16857--16867, Red Hook, NY, USA. Curran Associates Inc

work page 2020
[44]

Peter M. Stahl. 2021. Lingua : The most accurate natural language detection library for Python . https://github.com/pemistahl/lingua-py. Python bindings for the Lingua language detection library

work page 2021
[45]

Timothy Stanton. 2018. https://doi.org/10.1017/S0018246X17000450 John Locke and the Fable of Liberalism . The Historical Journal, 61(3):597--622

work page doi:10.1017/s0018246x17000450 2018
[46]

Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR : A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models . In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track ( Round 2)

work page 2021
[47]

Iiro Tiihonen and Kira Hinderks. 2025. https://doi.org/10.46298/transformations.14754 Genre Classification Workflow for the English Short Title Catalogue ( ESTC ) . Transformations: A DARIAH Journal, Workflows(Metadata-based workflows):14754

work page doi:10.46298/transformations.14754 2025
[48]

2020-2025

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-2025. https://github.com/HumanSignal/label-studio Label Studio : Data labeling software . Open source software available from https://github.com/HumanSignal/label-studio

work page 2020
[49]

Mikko Tolonen, Eetu Mäkelä, and Leo Lahti. 2022. https://muse.jhu.edu/pub/1/article/867734 The Anatomy of Eighteenth Century Collections Online ( ECCO ) . Eighteenth-Century Studies, 56(1):95--123. Publisher: Johns Hopkins University Press

work page 2022
[50]

Mikko Tolonen and Yann Ciar \'a n Ryan. 2026. Computational Methods in Intellectual History, pages 239--260. Proceedings of the British Academy. Liverpool University Press, United Kingdom

work page 2026
[51]

Mikko Tolonen and Mark G. Spencer. 2025. https://doi.org/10.1017/9781009047227.003 The Reception of David Hume ’s Essays in Eighteenth - Century Britain . In Max Skjönsberg and Felix Waldmann, editors, Hume's Essays , Cambridge Critical Guides , pages 15--35. Cambridge University Press, Cambridge

work page doi:10.1017/9781009047227.003 2025
[52]

Aleksi Vesanto, Filip Ginter, Hannu Salmi, Asko Nivala, and Tapio Salakoski. 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora . In Proceedings of the 21st Nordic Conference on Computational Linguistics , pages 330--333, Gothenburg, Sweden. Association for Computational Linguistics

work page 2017
[53]

Warren, Daniel Shore, Jessica Otis, Lawrence Wang, Mike Finegold, and Cosma Shalizi

Christopher N. Warren, Daniel Shore, Jessica Otis, Lawrence Wang, Mike Finegold, and Cosma Shalizi. 2016. Six Degrees of Francis Bacon: A Statistical Method for Reconstructing Large Historical Social Networks. DHQ: Digital Humanities Quarterly, 10(3):1

work page 2016
[54]

Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. 2025. OCR Hinders RAG : Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pages 17443--17453

work page 2025