pith. machine review for the scientific record. sign in

arxiv: 2604.27306 · v1 · submitted 2026-04-30 · 💻 cs.IR

Recognition: unknown

NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:02 UTC · model grok-4.3

classification 💻 cs.IR
keywords retrieval-augmented generationatomic retrievalnuggetstemporal validityknowledge maintenanceinformation retrievalfact verificationRAG maintenance
0
0 comments X

The pith

NuggetIndex stores atomic information as managed records with evidence links, temporal validity intervals, and lifecycle states so that invalid nuggets can be filtered before ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RAG systems retrieve passages or static propositions, which creates a mismatch when evaluation uses facts and when source documents change over time. NuggetIndex instead breaks information into atomic nuggets that carry explicit evidence pointers, time-bounded validity, and status flags. Filtering out invalid or superseded nuggets before ranking prevents outdated facts from reaching the generator. On a nuggetized MS MARCO subset, a temporal Wikipedia QA set, and a multi-hop task, the method lifts correct-nugget recall by 42 percent, raises temporal correctness by nine points without the recall penalty seen in simple time filters, and cuts answer conflicts by 55 percent. The same compact records shrink generator input length by 64 percent and support lightweight indexes.

Core claim

NuggetIndex stores atomic information units as managed records, each maintaining links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information while preserving recall and reducing conflicts.

What carries the argument

The nugget record, an atomic unit that carries evidence links, a temporal validity interval, and a lifecycle state to enable pre-ranking filtering of invalid entries.

If this is right

  • Nugget recall rises 42% over passage and unmanaged proposition baselines.
  • Temporal correctness improves nine percentage points without the recall collapse of time-filtered baselines.
  • Conflict rates among generated answers fall 55%.
  • Generator input length shrinks 64%, enabling smaller indexes for browser and edge deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automatic propagation of nugget updates when source documents are revised could further lower maintenance cost.
  • The same validity and conflict logic may transfer to versioned knowledge bases used outside retrieval-augmented generation.
  • Lightweight indexes suggest the approach could run on-device for mobile or privacy-sensitive applications.
  • Continuous evaluation on live news or legal corpora would expose real-world extraction and update overhead.

Load-bearing premise

Nuggets can be extracted from source passages with accurate evidence links, correct temporal intervals, and reliable lifecycle states, and this extraction can be maintained without systematic errors as the corpus evolves.

What would settle it

On the temporal Wikipedia QA dataset, if nugget extraction errors cause temporal correctness to fall below that of standard passage retrieval, the filtering benefit disappears.

Figures

Figures reproduced from arXiv: 2604.27306 by Jelena Mitrovic, Michael Granitzer, Saber Zerhoudi.

Figure 1
Figure 1. Figure 1: NuggetIndex pipeline. Raw text is normalized into atomic candidates. Algorithm 1 infers validity intervals using temporal expressions and revision history, while Algorithm 2 detects conflicts with the index to determine lifecycle states. Abstract Retrieval-augmented generation (RAG) systems are frequently eval￾uated via fact-based metrics, yet standard implementations retrieve passages or static propositio… view at source ↗
Figure 2
Figure 2. Figure 2: The NuggetIndex architecture. Documents are processed into atomic nuggets with temporal validity intervals and lifecycle states. At query time, the system filters by validity and state before ranking. Operational scope and terminology. We use fact to denote a (subject, predicate, object) triple extracted from a document span; we do not address philosophical truth or epistemological certainty. Validity is t… view at source ↗
Figure 3
Figure 3. Figure 3: Results on nuggetized MS MARCO —RAVine [44]. view at source ↗
Figure 5
Figure 5. Figure 5: Results on TimeQA [12]. System Nugget R@20 Nugget R (gen) Nugget F1 (gen) Temporal Correctness Conflict Rate Passage retrieval Hybrid-Passage .497±.019 .143±.007 .165±.007 .840±.006 .161±.008 Time-aware passage retrieval Hybrid + TimeFilter .002±.002 .111±.006 .121±.006 .921±.005 .148±.008 Hybrid + RecencyRerank .002±.002 .110±.006 .121±.007 .921±.005 .148±.008 Hybrid + LatestSnapshot .000±.000 .059±.009 .… view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) systems are frequently evaluated via fact-based metrics, yet standard implementations retrieve passages or static propositions. This unit mismatch between evaluation and retrieval objects hinders maintenance when corpora evolve and fails to capture superseded facts or source disagreements. We propose NuggetIndex, a retrieval system that stores atomic information units as managed records, so called nuggets. Each record maintains links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information. We evaluate the approach using a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task. Against passage and unmanaged proposition retrieval baselines, NuggetIndex improves nugget recall by 42%, increases temporal correctness by 9 percentage points without the recall collapse observed in time-filtered baselines, and reduces conflict rates by 55%. The compact nugget format reduces generator input length by 64% while enabling lightweight index structures suitable for browser-based and resource-constrained deployment. We release our implementation, datasets, and evaluation scripts

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes NuggetIndex, a retrieval system for RAG that stores atomic 'nuggets' as managed records containing evidence links, temporal validity intervals, and lifecycle states. By filtering invalid or deprecated nuggets prior to ranking, it aims to avoid outdated information. Evaluations on a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task report 42% higher nugget recall, +9 percentage points temporal correctness (without recall collapse), 55% fewer conflicts, and 64% shorter generator inputs versus passage and unmanaged proposition baselines. The compact format is positioned as suitable for resource-constrained deployment; code, datasets, and scripts are released.

Significance. If the filtering benefits prove robust and the nugget creation process is reproducible and maintainable at scale, the work could meaningfully advance RAG systems by closing the unit mismatch between retrieval objects and fact-based evaluation while handling corpus evolution. The emphasis on lightweight index structures and reduced input length offers practical value for deployment. However, the current results rest on pre-nuggetized evaluation sets, so the significance hinges on whether the governed attributes can be assigned and updated without systematic error in realistic settings.

major comments (3)
  1. [§4] §4 (Evaluation): The manuscript provides no description of how nuggets were extracted from source passages, how temporal validity intervals were assigned, or how lifecycle states (valid/deprecated) were determined. Because the headline gains (42% recall, +9pp temporal correctness, 55% conflict reduction) are produced by pre-ranking filtering on these attributes, the absence of extraction methodology, accuracy metrics, or error analysis makes it impossible to attribute improvements to the governed index rather than to the quality of the pre-processing step.
  2. [§4.2] §4.2 and §4.3 (Temporal Wikipedia QA and multi-hop results): No statistical significance tests, confidence intervals, or ablation on nugget-attribute accuracy are reported for the claimed improvements. The temporal-correctness gain is presented without showing whether it survives when nugget intervals contain realistic labeling noise, which directly tests the central claim that governed filtering prevents outdated information.
  3. [§3] §3 (NuggetIndex design): The system assumes that evidence links, temporal intervals, and lifecycle states can be maintained without drift as the corpus evolves, yet no mechanism, update protocol, or simulation of incremental corpus changes is described or evaluated. This leaves the 'maintainable' part of the title untested beyond static pre-nuggetized subsets.
minor comments (2)
  1. [Abstract] The abstract and §4 mention 'nuggetized MS MARCO subset' without clarifying whether the nuggetization was performed by the authors or by an independent process; this should be stated explicitly for reproducibility.
  2. [Figures/Tables] Figure captions and table legends should explicitly define 'nugget recall' and 'temporal correctness' so readers can interpret the 42% and +9pp figures without returning to the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the description of nugget creation, add statistical analysis and robustness checks, and provide initial evidence for maintainability. Point-by-point responses to the major comments follow.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The manuscript provides no description of how nuggets were extracted from source passages, how temporal validity intervals were assigned, or how lifecycle states (valid/deprecated) were determined. Because the headline gains (42% recall, +9pp temporal correctness, 55% conflict reduction) are produced by pre-ranking filtering on these attributes, the absence of extraction methodology, accuracy metrics, or error analysis makes it impossible to attribute improvements to the governed index rather than to the quality of the pre-processing step.

    Authors: We agree that the original manuscript insufficiently documented the upstream nuggetization process, which limits attribution of gains to the governed filtering mechanism. The evaluations used pre-nuggetized datasets (released with the paper), where nuggets were created via an LLM-assisted atomic decomposition pipeline applied to source passages. Temporal validity intervals were derived from source document timestamps and event metadata, while lifecycle states were assigned by detecting intra-nugget conflicts and supersession signals. To address the referee's concern, we have added a dedicated subsection in §4 that fully describes this pipeline, reports accuracy metrics on a human-annotated sample (e.g., 87% agreement on temporal intervals), and includes an error analysis of the pre-processing step. This revision makes explicit that the reported improvements arise from applying the governed index's filtering to these attributes rather than from the nugget creation quality alone. revision: yes

  2. Referee: [§4.2] §4.2 and §4.3 (Temporal Wikipedia QA and multi-hop results): No statistical significance tests, confidence intervals, or ablation on nugget-attribute accuracy are reported for the claimed improvements. The temporal-correctness gain is presented without showing whether it survives when nugget intervals contain realistic labeling noise, which directly tests the central claim that governed filtering prevents outdated information.

    Authors: We accept that the absence of statistical tests and noise ablations weakens the presentation of the temporal-correctness results. In the revised manuscript we now report 95% bootstrap confidence intervals for all primary metrics and apply paired Wilcoxon signed-rank tests, confirming statistical significance (p < 0.01) for the 9pp temporal-correctness improvement and 55% conflict reduction. We have also added a controlled ablation in §4.2 that injects realistic labeling noise (0–30% flip rate on temporal intervals and lifecycle states) drawn from observed annotation error patterns. The temporal-correctness advantage remains statistically significant up to approximately 15% noise before degrading, directly supporting the claim that governed pre-ranking filtering confers robustness even under imperfect attribute assignment. revision: yes

  3. Referee: [§3] §3 (NuggetIndex design): The system assumes that evidence links, temporal intervals, and lifecycle states can be maintained without drift as the corpus evolves, yet no mechanism, update protocol, or simulation of incremental corpus changes is described or evaluated. This leaves the 'maintainable' part of the title untested beyond static pre-nuggetized subsets.

    Authors: The original manuscript indeed evaluated only static snapshots and did not simulate corpus evolution, leaving the maintainability claim partially untested. We have expanded §3 to specify an incremental maintenance protocol: evidence links are used to re-validate nuggets against new corpus versions, temporal intervals are refreshed from updated source metadata, and lifecycle states are updated via automated conflict detection without requiring full re-indexing. A new simulation experiment in §4.4 applies this protocol to successive snapshots of the temporal Wikipedia dataset, showing that deprecated nuggets are filtered and new nuggets incorporated while preserving the reported recall and correctness gains. Although this remains a controlled simulation rather than a production-scale longitudinal study, it provides concrete evidence for the maintainability mechanisms described in the title. We note a full real-world deployment study as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical gains from explicit filtering on pre-assigned metadata

full rationale

The paper's core contribution is an empirical retrieval system that filters nuggets using explicitly maintained lifecycle states and temporal intervals before ranking. Reported improvements (42% nugget recall, +9pp temporal correctness, 55% fewer conflicts) are measured via direct comparison to passage and unmanaged-proposition baselines on held-out, pre-nuggetized datasets (MS MARCO subset and temporal Wikipedia QA). No equations, fitted parameters, or first-principles derivations appear; the filtering step is not a prediction derived from the same data but an application of independently assigned attributes. Evaluation metrics are computed externally and do not reduce to quantities defined inside the experiment. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that nuggets can be created and maintained with accurate metadata; this premise is introduced as part of the system rather than derived from prior results.

axioms (1)
  • domain assumption Atomic nuggets can be reliably extracted from passages together with correct evidence links, temporal validity intervals, and lifecycle states.
    The filtering step that prevents outdated information depends on this extraction being accurate and maintainable.
invented entities (1)
  • nugget no independent evidence
    purpose: Managed atomic information unit carrying evidence, temporal validity, and lifecycle state for governed retrieval.
    The nugget is the new core data structure introduced to resolve the unit mismatch between evaluation and retrieval objects.

pith-pipeline@v0.9.0 · 5495 in / 1462 out tokens · 72409 ms · 2026-05-07T08:02:22.934159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    [n. d.]. Wikidata: Data model. https://www.wikidata.org/wiki/Wikidata:Data_ model. Accessed: 2025-12-08

  2. [2]

    W3C Recommendation

    2013.SPARQL 1.1 Query Language. W3C Recommendation. World Wide Web Consortium (W3C). https://www.w3.org/TR/sparql11-query/

  3. [3]

    Amazon Web Services. 2024. Introducing the GraphRAG Toolkit: Lexical Graph for Amazon Neptune. AWS Database Blog. https://aws.amazon.com/blogs/ database/introducing-the-graphrag-toolkit/

  4. [4]

    NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

  5. [5]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268(2016)

  6. [6]

    Rudolf Bayer and Edward McCreight. 1970. Organization and maintenance of large ordered indices. InProceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control. 107–141

  7. [7]

    Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum. 2010. A language modeling approach for temporal information needs. InEuropean conference on information retrieval. Springer, 13–25

  8. [8]

    Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R istituto superiore di scienze economiche e commericiali di firenze 8 (1936), 3–62

  9. [9]

    Ricardo Campos, Gaël Dias, Alípio Jorge, and Célia Nunes. 2016. GTE-Rank: A time-aware search engine to answer time-sensitive queries.Information Process- ing & Management52, 2 (2016), 273–298

  10. [10]

    Laura Caspari, Kanishka Ghosh Dastidar, Saber Zerhoudi, Jelena Mitrovic, and Michael Granitzer. 2024. Beyond benchmarks: Evaluating embedding model simi- larity for retrieval augmented generation systems.arXiv preprint arXiv:2407.08275 (2024)

  11. [11]

    Angel X Chang and Christopher D Manning. 2012. Sutime: A library for recog- nizing and normalizing time expressions.. InLrec, Vol. 12. 3735–3740

  12. [12]

    Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. 2024. Dense x retrieval: What retrieval granu- larity should we use?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 15159–15177

  13. [13]

    Wenhu Chen, Xinyi Wang, and William Yang Wang. 2021. A dataset for answer- ing time-sensitive questions.arXiv preprint arXiv:2108.06314(2021)

  14. [14]

    Cohere. [n. d.]. An Overview of Cohere’s Models (embed-english-v3.0). https: //docs.cohere.com/docs/models. Accessed: 2026-01-19

  15. [15]

    Yimin Deng, Yuxia Wu, Yejing Wang, Guoshuai Zhao, Li Zhu, Qidong Liu, Derong Xu, Zichuan Fu, Xian Wu, Yefeng Zheng, et al. 2025. A Multi-Expert Structural- Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs. InFindings of the Association for Computational Linguistics: ACL 2025. 20553–20565

  16. [16]

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitch- hiker’s guide to testing statistical significance in natural language processing. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1383–1392

  17. [17]

    Shengbo Gong, Xianfeng Tang, Carl Yang, et al . 2025. Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking. arXiv preprint arXiv:2508.02435(2025)

  18. [18]

    Daniel Huwiler, Kurt Stockinger, and Jonathan Fürst. 2025. VersionRAG: Version- Aware Retrieval-Augmented Generation for Evolving Documents.arXiv preprint arXiv:2510.08109(2025)

  19. [19]

    Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. 874–880

  20. [20]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

  21. [21]

    Simon Knollmeyer, Oğuz Caymazer, and Daniel Grossmann. 2025. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain.Electronics 14, 11 (2025), 2102

  22. [22]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7 (2019), 453–466

  23. [23]

    Weronika Łajewska and Krisztian Balog. 2025. Ginger: Grounded information nugget-based generation of responses. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2723–2727

  24. [24]

    GG Landis JRKoch. 1977. The measurement of observer agreement for categorical data.Biometrics33, 1 (1977), 159174

  25. [25]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  26. [26]

    Bruce Croft

    Xiaoyan Li and W. Bruce Croft. 2003. Time-Based Language Models. InProceed- ings of the 12th International Conference on Information and Knowledge Manage- ment (CIKM ’03). 469–475

  27. [27]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts, 2023.URL https://arxiv. org/abs/2307.03172(2023)

  28. [28]

    Xueguang Ma, Kai Sun, Ronak Pradeep, and Jimmy Lin. 2021. A replication study of dense passage retriever.arXiv preprint arXiv:2104.05740(2021)

  29. [29]

    Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transactions on pattern analysis and machine intelligence42, 4 (2018), 824–836

  30. [30]

    MS MARCO Team. [n. d.]. MS MARCO: A Collection of Datasets Focused on Deep Learning in Search. https://microsoft.github.io/msmarco/. Accessed: 2026-01-19

  31. [31]

    OpenAI. [n. d.]. GPT-4o mini model — OpenAI API documentation. https: //platform.openai.com/docs/models/gpt-4o-mini. Accessed: 2026-01-19

  32. [32]

    OpenAI. 2026. Pricing. https://platform.openai.com/docs/pricing. Accessed 2026-01-20

  33. [33]

    Andrew Parry, Maik Fröbe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Fag- gioli, Saber Zerhoudi, Sean MacAvaney, and Eugene Yang. 2025. Variations in relevance judgments and the shelf life of test collections. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3387–3397

  34. [34]

    Golbus, and Javed A

    Virgil Pavlu, Shahzad Rajput, Peter B. Golbus, and Javed A. Aslam. 2012. IR System Evaluation Using Nugget-Based Test Collections. InProceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12). ACM, 393–402

  35. [35]

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework.CoRRabs/2411.09607 (2024). https://arxiv.org/abs/2411.09607

  36. [36]

    2009.The probabilistic relevance frame- work: BM25 and beyond

    Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

  37. [37]

    Mark D Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. InProceedings of the sixteenth ACM conference on Conference on information and knowledge management. 623–632

  38. [38]

    2012.The TSQL2 temporal query language

    Richard T Snodgrass. 2012.The TSQL2 temporal query language. Vol. 330. Springer Science & Business Media

  39. [39]

    Robert J Tibshirani and Bradley Efron. 1993. An introduction to the bootstrap. Monographs on statistics and applied probability57, 1 (1993), 1–436

  40. [40]

    TREC RAG Organizers. 2024. TREC 2024 RAG Corpus: MS MARCO V2.1 Doc- ument Corpus and Segmented Version. Blog post. https://trec-rag.github.io/ annoucements/2024-corpus-finalization/ Accessed: 2026-01-19

  41. [41]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  42. [42]

    Transactions of the Association for Computational Linguistics10 (2022), 539–554

    MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics10 (2022), 539–554

  43. [43]

    Ellen M Voorhees and L Buckland. 2003. Overview of the TREC 2003 Question Answering Track.. InTREC, Vol. 2003. 54–68

  44. [44]

    Voorhees and Lori Buckland

    Ellen M. Voorhees and Lori Buckland. 2003. Overview of the TREC 2003 Question Answering Track. InProceedings of the Twelfth Text REtrieval Conference (TREC 2003). 54–68

  45. [45]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.Commun. ACM57, 10 (2014), 78–85

  46. [46]

    Yilong Xu, Xiang Long, Zhi Zheng, and Jinhua Gao. 2025. Ravine: Reality-aligned evaluation for agentic search.arXiv preprint arXiv:2507.16725(2025)

  47. [47]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

  48. [48]

    Saber Zerhoudi, Michael Dinzinger, Michael Granitzer, and Jelena Mitrovic. 2026. OwlerLite: Scope-and Freshness-Aware Web Retrieval for LLM Assistants.arXiv preprint arXiv:2601.17824(2026)

  49. [49]

    Saber Zerhoudi and Michael Granitzer. 2024. Personarag: Enhancing retrieval- augmented generation systems with user-centric agents.arXiv preprint arXiv:2407.09394(2024)

  50. [50]

    Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. InProceedings of the 2021 conference on empirical methods in natural language processing. 7371–7387