pith. machine review for the scientific record. sign in

arxiv: 2605.05103 · v2 · submitted 2026-05-06 · 💻 cs.CL · cs.AI· cs.CY

Recognition: 2 theorem links

· Lean Theorem

Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement

Chieh Ting Yeh, Nicholas S. Kersting, Saad Taame, Vittorio Castelli, Xinzhu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords concept fieldshallucination detectionnovelty measurementsentence embeddingsblack-box evaluationgroundedness scoringvector sequence database
0
0 comments X

The pith

A text corpus defines a concept field in sentence-embedding space whose local drift scores the groundedness of new sentences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces concept fields as a way to represent a text corpus through the local patterns of how sentences follow one another in embedding space. By modeling the typical delta vectors between consecutive sentences as Gaussian distributions, it computes a z-score that indicates whether a candidate sentence transition fits the corpus's established patterns. This provides a black-box method to detect hallucinations in generated text or measure novelty, without needing access to the language model's internals. A sympathetic reader would care because it offers an interpretable, corpus-specific signal that works across different domains like legal texts and literature, potentially complementing other detection techniques.

Core claim

The central claim is that the concept field, estimated from deltas in sentence-embedding space, yields a standardized deviation score that enables strong selective classification for grounded versus ungrounded sentences in controlled LLM rewrites, with coverage-risk behavior remaining consistent between the U.S. Code of Federal Regulations for hallucination detection and Project Gutenberg for novelty detection.

What carries the argument

The Concept Field: a local drift field with pointwise uncertainty in sentence-embedding space, constructed from the distribution of deltas between consecutive sentences in the corpus.

If this is right

  • If the local Gaussian approximation holds, the standardized deviation score supports a grounded/ungrounded/unsure triage policy with high selective classification accuracy.
  • The method produces comparable coverage-risk tradeoffs across legal and literary domains, unlike retrieval-centric baselines.
  • Divergence and curl operators applied to the field on dense clusters can surface semantic patterns such as logic sources, sinks, and implicit topics.
  • The Vector Sequence Database enables efficient storage and lookup of embeddings with sequence and delta metadata for corpus-attributable scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to streaming text by updating the concept field incrementally as new sentences arrive.
  • Integration with existing vector stores might allow hybrid retrieval-plus-field verification systems for generated content.
  • Qualitative field patterns could serve as a starting point for unsupervised topic modeling or narrative flow analysis in large corpora.

Load-bearing premise

That the differences between consecutive sentence embeddings in a corpus follow a distribution close enough to Gaussian for the z-score to meaningfully indicate agreement with the corpus's concept field.

What would settle it

A demonstration that the selective classification performance on held-out domains produces markedly different coverage-risk curves from those observed on the U.S. Code and Gutenberg corpora, or that the local delta distributions deviate substantially from Gaussian in measured text collections.

Figures

Figures reproduced from arXiv: 2605.05103 by Chieh Ting Yeh, Nicholas S. Kersting, Saad Taame, Vittorio Castelli, Xinzhu Wang.

Figure 1
Figure 1. Figure 1: General design: raw text is split into sentence sequences, embedded as concept vectors, and stored in a VSDB. The induced Concept Field supports groundedness scoring by mean absolute z-distance and exploratory geometric analysis. 2 Related Work Black-box hallucination detection through self-consistency methods such as SelfCheckGPT [16], SAC3 [17], and FactSelfCheck [18] rely on multiple generations. By con… view at source ↗
Figure 2
Figure 2. Figure 2: Corpus vector field for 2D ballistics. 5 The Concept Field of a Text Corpus The construction of a high-dimensional text corpus Concept Field will proceed in a similar manner to the 2D example we looked at above with several additional complications: now the ‘trajectories’ consist of sequences of English sentences, possibly of varying length, contained in paragraphs, requiring sentence and paragraph splitti… view at source ↗
Figure 3
Figure 3. Figure 3: Interpolation check. Left/right columns are mild/intense source sequences; middle columns show SONAR-decoded embedding averages and IDW-field drift outputs. As another check, on a realistic corpus (CFR 2016) field-following from an in-corpus sentence remains close to the source sequence while local support is strong; once support weakens, significance (measured by the average ratio of the delta to its stan… view at source ↗
Figure 4
Figure 4. Figure 4: Field-following examples. Top: start from an in-corpus CFR sentence and track a supported drift. Bottom: start from an arbitrary sentence and follow the local field where defined. Significance values are in white. Finally, we performed several sanity checks on the approximate Gaussian distribution of deltas of nearest semantic neighbors in a VSDB which the reader may refer to in the Appendix. 5.1 Testing G… view at source ↗
Figure 5
Figure 5. Figure 5: Threshold sensitivity heat maps for (a) Hallucination detection in the CFR, and (b) Novelty detection in the Gutenberg corpus. Despite being very different corpora, the coverage-performance trade-off is similar (AURC= 0.03 and 0.08, respectively) view at source ↗
Figure 6
Figure 6. Figure 6: Interpolation check. Left/right columns are mild/intense source sequences; middle columns show SONAR-decoded embedding averages and IDW-field drift outputs. 14 view at source ↗
Figure 6
Figure 6. Figure 6: Interpolation check. Left/right columns are mild/intense source sequences; middle columns show SONAR-decoded embedding averages and IDW field drift outputs [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Interpolation check. Left/right columns are mild/intense source sequences; middle columns show SONAR-decoded embedding averages and IDW-field drift outputs. E. Fundamental issue with cosine similarity Consider the following pair of very similar sentences which in fact only differ by two words: • "However, additional funding may be available in Fiscal Year 2001 to award a limited number of grants under the … view at source ↗
Figure 7
Figure 7. Figure 7: Interpolation check. Left/right columns are mild/intense source sequences; middle columns show SONAR-decoded embedding averages and IDW field drift outputs. The cosine similarity of their SONAR embeddings is 0.96. If we negate one of them, • "However, additional funding may be available in Fiscal Year 2001 to award a limited number of grants under the College and University Affiliations Program to enable c… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of corpus regions with extreme divergence and/or curl, shown in 2D UMAP plots. References [1] T. Doan. Hallucination mitigation for faithful retrieval-augmented generation. Aaltodoc (Thesis), 2025. Introduces MAESTRO, a hybrid strategy combining "Decoding by Contrasting Layers" (DoLa) with attention-based "Over-trust Penalty and Retrospection-Allocation" (OPERA) to reduce hallucinations in RAG. [2… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of corpus regions in UMAP plots with extreme divergence and/or curl. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

We introduce the \textbf{Concept Field} of a text corpus: a local drift field with pointwise uncertainty, estimated in sentence-embedding space from the deltas between consecutive sentences. Given a candidate sentence transition, we score its agreement with the field by $\zeta$, the mean absolute z-distance between the observed delta and the field's local Gaussian estimate. The score is black-box (no model internals), corpus-attributable (every score traces to nearby corpus sentences), and admits a probabilistically motivated interpretation under a local Gaussian approximation. We support the computation with the introduction of a \textbf{Vector Sequence Database (VSDB)} that stores embeddings together with sequence-position and next-delta metadata. We evaluate this approach on two large-scale settings: hallucination-style groundedness detection over the U.S. Code of Federal Regulations, and novelty detection over Project Gutenberg. On controlled LLM-generated rewrites, Concept Fields achieve strong selective classification performance under a grounded / ungrounded / unsure triage policy. Unlike retrieval-centric baselines, the resulting coverage-risk behavior is similar across both domains, supporting a degree of cross-domain stability for the standardized deviation score. We also sketch how divergence and curl of the Concept Field, computed on dense clusters, surface qualitatively meaningful semantic patterns (logic sources, sinks, and implicit topics), which we offer as hypothesis-generating rather than as a quantitative result. Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Concept Field as a local drift field in sentence-embedding space, constructed from deltas between consecutive sentences, along with a Vector Sequence Database (VSDB) for storage and querying. It defines a black-box groundedness/novelty score ζ as the mean absolute z-distance to the local Gaussian estimate of the field. The approach is evaluated on hallucination detection in the U.S. Code of Federal Regulations and novelty detection in Project Gutenberg using controlled LLM-generated rewrites, claiming strong selective classification under a grounded/ungrounded/unsure policy with cross-domain stable coverage-risk behavior compared to retrieval baselines. It also sketches qualitative uses of field divergence and curl for semantic pattern discovery.

Significance. If the local Gaussian model is validated and the empirical claims hold, the work offers a lightweight, interpretable, black-box method for groundedness and novelty detection that is corpus-attributable and shows potential cross-domain stability. The VSDB provides reusable infrastructure, and the vector-field framing enables new qualitative analyses of semantic flows (sources, sinks, topics). This is complementary to LLM-as-judge and white-box approaches. Credit is due for the parameter-free standardized score and the focus on traceability to corpus sentences.

major comments (2)
  1. [Method section (Concept Field and ζ definition)] The probabilistic interpretation and cross-domain stability of ζ rest on the assumption that local deltas between consecutive sentence embeddings are well-approximated by a multivariate Gaussian whose covariance yields calibrated z-distances. Sentence-embedding spaces are high-dimensional and typically concentrate on lower-dimensional manifolds, making Gaussianity unlikely due to concentration of measure, heavy tails, or anisotropy. No empirical validation (normality tests, QQ plots, or residual analysis on the deltas) is supplied in the method or evaluation sections, which is load-bearing for the triage policy and the claim that the standardized score behaves similarly across domains.
  2. [Evaluation section] The abstract and evaluation claim 'strong selective classification performance' and 'coverage-risk behavior is similar across both domains' on controlled LLM rewrites, yet supply no numeric metrics (AUC, coverage at given risk, precision-recall), no explicit baseline numbers for retrieval-centric methods, no error bars, and no statistical tests. This absence prevents assessment of the magnitude of the result and the robustness of the cross-domain stability finding.
minor comments (2)
  1. [Abstract] The abstract asserts quantitative performance claims without any numbers or effect sizes; adding one or two key metrics (e.g., AUC under the triage policy) would make the summary self-contained.
  2. [Notation and preliminaries] Notation for the Concept Field, local covariance estimate, and VSDB metadata could be introduced with a compact formal definition or pseudocode early in the paper to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method section (Concept Field and ζ definition)] The probabilistic interpretation and cross-domain stability of ζ rest on the assumption that local deltas between consecutive sentence embeddings are well-approximated by a multivariate Gaussian whose covariance yields calibrated z-distances. Sentence-embedding spaces are high-dimensional and typically concentrate on lower-dimensional manifolds, making Gaussianity unlikely due to concentration of measure, heavy tails, or anisotropy. No empirical validation (normality tests, QQ plots, or residual analysis on the deltas) is supplied in the method or evaluation sections, which is load-bearing for the triage policy and the claim that the standardized score behaves similarly across domains.

    Authors: We thank the referee for highlighting this important point. The local Gaussian model is indeed an approximation chosen for its interpretability and to enable the standardized z-score. While we believe the local estimation mitigates some effects of high-dimensional concentration by focusing on nearby points in the corpus, we agree that explicit validation is necessary to support the cross-domain claims. In the revised manuscript, we will include QQ plots, histograms of standardized residuals, and formal normality tests (e.g., Shapiro-Wilk) computed on the sentence deltas from both evaluation corpora. These additions will allow assessment of how well the Gaussian assumption holds and inform the reliability of the triage policy. revision: yes

  2. Referee: [Evaluation section] The abstract and evaluation claim 'strong selective classification performance' and 'coverage-risk behavior is similar across both domains' on controlled LLM rewrites, yet supply no numeric metrics (AUC, coverage at given risk, precision-recall), no explicit baseline numbers for retrieval-centric methods, no error bars, and no statistical tests. This absence prevents assessment of the magnitude of the result and the robustness of the cross-domain stability finding.

    Authors: We agree that providing concrete numerical results is essential for evaluating the strength of our claims. The current manuscript emphasizes the qualitative aspects of the coverage-risk curves and the cross-domain similarity, but we will expand the evaluation section in the revision to include specific metrics such as AUC-ROC for the selective classification, coverage at fixed risk levels (e.g., 5% and 10% risk), precision and recall for the grounded/ungrounded classes, direct numeric comparisons to the retrieval baselines, standard error bars from repeated experiments, and p-values from statistical tests comparing the methods. This will substantiate the 'strong' performance and the stability observation with quantifiable evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: zeta score and Concept Field are defined directly from corpus deltas without reduction to fitted inputs or self-citations

full rationale

The paper defines the Concept Field as a local drift field estimated from observed deltas between consecutive sentence embeddings in the corpus, with zeta computed as the mean absolute z-distance to the local Gaussian approximation of those deltas. This construction is self-contained and does not involve fitting parameters to a target task or renaming a known result as a prediction. The reported selective classification performance on controlled LLM rewrites is an empirical measurement of the defined score's behavior, not a derivation that collapses back to the inputs by construction. No load-bearing self-citations or uniqueness theorems appear in the provided text, and the Gaussian approximation is stated as an assumption rather than derived from the result itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the unvalidated local Gaussian model in embedding space and introduces two new constructs without external evidence; no free parameters are explicitly fitted in the abstract.

axioms (2)
  • domain assumption Local Gaussian approximation holds for sentence deltas in embedding space
    Invoked to define the field's uncertainty and to motivate the z-distance score.
  • domain assumption Consecutive sentences provide representative semantic transitions for field estimation
    Basis for constructing the drift field from the corpus.
invented entities (2)
  • Concept Field no independent evidence
    purpose: Local drift field with pointwise uncertainty for scoring text transitions
    Newly introduced mathematical object central to the method.
  • Vector Sequence Database (VSDB) no independent evidence
    purpose: Storage of embeddings with sequence-position and next-delta metadata to enable efficient field queries
    New data structure proposed to support computation.

pith-pipeline@v0.9.0 · 5590 in / 1424 out tokens · 67784 ms · 2026-05-12T03:39:17.349363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    T. Doan. Hallucination mitigation for faithful retrieval-augmented generation.Aaltodoc (Thesis),

  2. [2]

    Decoding by Contrasting Layers

    Introduces MAESTRO, a hybrid strategy combining "Decoding by Contrasting Layers" (DoLa) with attention-based "Over-trust Penalty and Retrospection-Allocation" (OPERA) to reduce hallucinations in RAG

  3. [3]

    attention sinks

    Yining Wang, Mi Zhang, Junjie Sun, et al. Mirage in the eyes: Hallucination attack on multi- modal large language models with only attention sink.arXiv preprint arXiv:2501.15269, 2025. While focusing on attacks, this paper details white-box mitigation through the manipulation of "attention sinks" and hidden embeddings to ensure image-text relevance in mul...

  4. [4]

    vague knowledge boundary

    Yue Zhang, Yafu Li, Leyang Cui, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv/MIT Press Direct, 2025. A foundational survey that taxonomizes white-box challenges, specifically noting how a model’s "vague knowledge boundary" and internal states complicate the explanation and mitigation of errors

  5. [5]

    Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu

    Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.arXiv preprint arXiv:2502.15845, 2025. Contrasts black-box self-consistency with white-box methods that utilize token-level logits and intermediate representations (referencing prior work like Varsh...

  6. [6]

    universal

    Y . Xu et al. Factselfcheck: Fact-level black-box hallucination detection for llms.ACL Anthology (Findings of EACL), 2026. Provides a 2026 update on the white-box vs. black-box divide, noting that white-box methods analyze internal states (e.g., Azaria and Mitchell 2023) to achieve more "universal" detection across different model architectures

  7. [7]

    grey-box

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. A landmark paper defining the "grey-box" approach of using token-level probabilities to cluster responses by meaning rather than syntax

  8. [8]

    Semantic uncertainty: Detecting hallucinations in LLMs

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Detecting hallucinations in LLMs. InInternational Conference on Learning Representations (ICLR), 2023. The predecessor to the 2024 Nature paper; focuses on using entropy over the final logits to quantify epistemic uncertainty

  9. [9]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  10. [10]

    G- Eval: NLG evaluation using GPT-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Kong Ruigi, and Chenguang Chen. G- Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12432–12448, 2023. Key reference for using multi-step reasoning (CoT) in the evaluation prompt. 10

  11. [11]

    LLM-as-a-judge: Rapid evaluation of legal document recommendation for retrieval-augmented generation

    Anu Pradhan, Alexandra Ortan, Apurv Verma, and Madhavan Seshadri. LLM-as-a-judge: Rapid evaluation of legal document recommendation for retrieval-augmented generation. In ResearchGate / Bloomberg Technical Reports, 2025. Analyzes Inter-Rater Reliability (IRR) between human legal experts and LLM judges in RAG pipelines

  12. [12]

    Kersting, Mohammad Rahman, Suchismitha Vedala, and Yang Wang

    Nicholas S. Kersting, Mohammad Rahman, Suchismitha Vedala, and Yang Wang. Harmonic llms are trustworthy, 2024

  13. [13]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, pages 5998–6008, 2017

  14. [14]

    InProceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

    Paul-Ambroise Duquenne, Holger Schwenk, and Benoît Sagot. SONAR: Sentence-level multimodal and language-agnostic representations.arXiv preprint arXiv:2308.11466, 2023

  15. [15]

    Code of federal regulations

    US Government. Code of federal regulations. https://www.govinfo.gov/bulkdata/CFR/

  16. [16]

    Gutenberg corpus

    Project Gutenberg. Gutenberg corpus. https://www.kaggle.com/datasets/ lokeshparab/gutenberg-books-and-metadata-2025

  17. [17]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9004–9017. Association for Computational Linguistics, 2023

  18. [18]

    SAC 3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency

    Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley Malin, and Kumar Sricharan. SAC 3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  19. [19]

    Xu et al

    Y . Xu et al. FactSelfCheck: Fact-level black-box hallucination detection for LLMs.Findings of the European Chapter of the Association for Computational Linguistics (EACL), 2026. arXiv preprint arXiv:2503.17229

  20. [20]

    Hallucination detection: A probabilistic framework using embeddings distance analysis.arXiv preprint arXiv:2502.08663, 2025

    Emanuele Ricco et al. Hallucination detection: A probabilistic framework using embeddings distance analysis.arXiv preprint arXiv:2502.08663, 2025

  21. [21]

    Geometric hallucination detection via directional consistency in embedding space.https://cert-framework.com/docs/research/dc-paper, 2026

    CERT Research. Geometric hallucination detection via directional consistency in embedding space.https://cert-framework.com/docs/research/dc-paper, 2026. January 2026

  22. [22]

    A Geometric Taxonomy of Hallucinations in LLMs

    CERT Research. A geometric taxonomy of hallucinations in LLMs.arXiv preprint arXiv:2602.13224, 2026

  23. [23]

    Geometric uncertainty for detecting and correcting hallucinations in llms

    others. Geometric uncertainty for detecting and correcting hallucinations in LLMs.arXiv preprint arXiv:2509.13813, 2025

  24. [24]

    Dynamic programming algorithm optimization for spoken word recognition

    Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. InIEEE Transactions on Acoustics, Speech, and Signal Processing, volume 26, pages 43–49, 1978

  25. [25]

    Sentence similarity based on dynamic time warping

    Xiao-Ying Liu, Yi-Ming Zhou, and Rong-Shu Zheng. Sentence similarity based on dynamic time warping. InProceedings of the International Conference on Semantic Computing (ICSC), pages 250–256. IEEE, 2007

  26. [26]

    Iglesias

    Guangyou Zhu and Carlos A. Iglesias. Measuring similarity of academic articles with semantic profile and joint word embedding.Tsinghua Science and Technology, 22(6):619–632, 2017

  27. [27]

    Similarity measure of time series based on angle-distance penalized metric dynamic time warping.Scientific Reports, 2025

    others. Similarity measure of time series based on angle-distance penalized metric dynamic time warping.Scientific Reports, 2025

  28. [28]

    ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT

    Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48, 2020. 11

  29. [29]

    MUVERA: Multi-vector retrieval via fixed dimensional encodings

    Laxman Dhulipala et al. MUVERA: Multi-vector retrieval via fixed dimensional encodings. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

  30. [30]

    General neural embedding for sequence distance approximation

    others. General neural embedding for sequence distance approximation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025

  31. [31]

    Topology of word embeddings: Singularities reflect polysemy

    Alexander Jakubowski, Milica Gasic, and Marcus Zibrowius. Topology of word embeddings: Singularities reflect polysemy. InProceedings of the Ninth Joint Conference on Lexical and Computational Semantics (*SEM), pages 103–113. Association for Computational Linguistics, 2020

  32. [32]

    Topological Data Analysis Applications in Natural Language Processing: A Survey

    Ada Uchendu et al. Unveiling topological structures in text: A comprehensive survey of topological data analysis applications in NLP.arXiv preprint arXiv:2411.10298, 2025. Updated June 2025, surveying 95 papers

  33. [33]

    Omnilingual SONAR: Cross-lingual and cross- modal sentence embeddings bridging massively multilingual text and speech.arXiv preprint arXiv:2603.16606, 2026

    OmniSONAR Team, João Maria Janeiro, et al. Omnilingual SONAR: Cross-lingual and cross- modal sentence embeddings bridging massively multilingual text and speech.arXiv preprint arXiv:2603.16606, 2026

  34. [34]

    The Faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The Faiss library.arXiv preprint arXiv:2401.08281, 2024. Revised October 2025

  35. [35]

    A two-dimensional interpolation function for irregularly-spaced data.Pro- ceedings of the 1968 23rd ACM National Conference, pages 517–524, 1968

    Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data.Pro- ceedings of the 1968 23rd ACM National Conference, pages 517–524, 1968

  36. [36]

    Spacy.https://spacy.io/

  37. [37]

    English premier league - match commentary

    Pranav Karnani. English premier league - match commentary. https://www.kaggle.com/ datasets/pranavkarnani/english-premier-league-match-commentary?select= 23_24_match_details.csv. APPENDIX A. SONAR embeddings Clearly our agenda is dependent on the choice of embedding scheme: a poorly-trained embedding will not exhibit meaningful correlations between distan...

  38. [38]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The monetized, annualized costs (2020 dollars) of the provisions in the NPRM (if finalized as proposed) are estimated to be $29 million (range: $77–$87 ...

  39. [39]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three to 7 per cent discount rates are summarized 13 below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $77–$87 million) us...

  40. [40]

    The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three to 7 per cent discount rates are summarized below

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three to 7 per cent discount rates are summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $7.7 million to $8.7 m...

  41. [41]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The monetized, annual $20 million costs (provisions in the NPRM) if finalized (as proposed) are estimated to be $29 million: $7.70 million (from $7.87 m...

  42. [42]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $77–8.7 ...

  43. [43]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $7.7 mil...

  44. [44]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The monetized, annual costs (2020 dollars) of the provisions in the NPRM (if finalized as proposed) are estimated to be $29 million (range: $7.7 to $8.7...

  45. [45]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $77–$87 ...

  46. [46]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $7.7 mil...

  47. [47]

    s1: “The annualized and present value estimates of monetised costs and benefits over the 10-year period from 2023 to 2032 using three per cent and seven per cent discount rates are 14 summarized below.” s2: “The annualised, monetized costs (2020 dollars) of the provisions of the NPRM (if finalized as proposed) are estimated to be $29 million (range: $7,70...

  48. [48]

    extract pairs of sequential sentences from CFR sequences

  49. [49]

    - Keep the new second sentence very close to the original wording and avoid broad paraphrases, extra details, or a no-op copy

    reword with LLM to change slightly; use LLM-as-judge with human spot-checking to keep only pairs that do not significantly change semantics; * : Generative Prompt:“Generate a new second sentence that preserves the meaning of the original second sentence while making only a slight wording 17 change. - Keep the new second sentence very close to the original...

  50. [51]

    for each re-worded pair, find closest pair (by similarity with first sentence) in VSDB, and its ζ ; if ζ > ζ high, consider this pair ‘too noisy’; i.e., in actual inference application one would not trust this region of the Concept Field, tantamount to the noisy region of the ballistic plots in our 2D example

  51. [52]

    If ζ > ζ high, False Positive

    compare test sequence ζ with that of closest sequence: if they are both less than ζlow, mark as True Negative (for hallucination). If ζ > ζ high, False Positive. Otherwise "unsure" (this lowers coverage). –Negative examples

  52. [53]

    extract pairs of sequential sentences from CFR sequences (possibly same as above)

  53. [54]

    reword with LLM to change second sentence significantly; use LLM-as-judge with human spot-checking to keep only sequences that have significantly changed semantics; * : Opposite-type example · : Generative Prompt:“Generate a new second sentence that expresses the opposite or a clear contradiction of the original second sentence. - Make the contradiction c...

  54. [55]

    checkζof each re-worded pair

  55. [56]

    for each re-worded pair, find closest pair (by similarity with first sentence) in VSDB, and itsζ; ifζ > ζ high, consider this pair ‘too noisy’

  56. [57]

    A discussion of work to be per- formed during the next reporting pe- riod

    compare test sequence ζ with that of its closest sequence: if they are both less than ζlow, mark as False Negative (for untruthfulness). Ifζ > ζ high, True Positive. –Compute F1 Score * P = TP/(TP + FP) * R = TP/(TP + FN) * F1 = 2P*R/(P + R) –Other metrics (MCC, AUC, Cross Entropy) follow standard procedure. – We also compute Area under the Risk Curve (AU...

  57. [58]

    extract pairs of sequential sentences from Gutenberg sequences

  58. [59]

    (7) Describe the language capabili- ties of staff proposed for this study

    reword with LLM to changeboth sentencesslightly; use LLM-as-judge with human spot-checking to keep only pairs that do not significantly change semantics; * : Generative Prompt:“You will be given a sentence pair. Rewrite both sentences with slight wording changes while preserving their meanings. Requirements: - Preserve the meaning of the first sentence. -...

  59. [60]

    check ζ of each re-worded pair: if ’out of corpus’ or ’insufficient statistics’,mark as False Positive

  60. [62]

    ifζ > ζ high, False Positive

    compare test sequence ζ with closest sequence: if they are both less than ζlow, mark as True Negative (for novelty). ifζ > ζ high, False Positive. –Positive examples

  61. [63]

    extract pairs of sequential sentences from Gutenberg sequences (possibly same as above) 20 Year TP TN FP FN P R F1 MCC AUC LogLoss 2000 618 885 61 138 0.91 0.82 0.86 0.76 0.94 1.37 2001 583 872 62 174 0.90 0.77 0.83 0.72 0.92 1.41 2005 619 846 58 174 0.91 0.78 0.84 0.73 0.93 1.35 2008 625 835 85 128 0.88 0.83 0.85 0.74 0.93 1.55 2011 578 833 66 159 0.90 0...

  62. [64]

    Rewrite both sentences

    reword with LLM tochange first sentence slightly and second sentence radically; use LLM-as-judge with human spot-checking to keep only sequences that have significantly changed semantics; * : Generative Prompt:“You will be given a sentence pair. Rewrite both sentences. Requirements: - Rewrite the first sentence with only a slight wording change while pres...

  63. [65]

    check ζ of each re-worded pair: if ’out of corpus’ or ’insufficient statistics’,mark as True Positive

  64. [66]

    too noisy

    for re-worded pair, find closest pair (by similarity with first sentence) in VSDB, and itsζ; ifζ > ζ high, consider this pair "too noisy"

  65. [67]

    In returning now to the adventures of Siegfried there is little more to be described except the finale of an opera

    compare test sequence ζ with closest sequence: if they are both less than ζlow, mark as False Negative (for novelty). ifζ > ζ high, True Positive. –Compute F1 Score * P = TP/(TP + FP) * R = TP/(TP + FN) * F1 = 2P*R/(P + R) –Other metrics (MCC, AUC, Cross Entropy) follow standard procedure as before. As for creativity detection, we generate Negative exampl...