arxiv: 2604.27906 · v2 · submitted 2026-04-30 · 💻 cs.AI · cs.CL

Recognition: unknown

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Alexander Gusak, Alex Petrov, Denis Mukha, Dima Korolev

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords schema-grounded memoryiterative extractionAI memory systemsstructured data extractionvalidation gatesreliable agentswrite-path designend-to-end memory benchmarks

0 comments

The pith

Reliable AI memory requires schemas that guide iterative extraction and validation at write time rather than text retrieval at read time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that persistent AI memory works better when schemas define what facts must be stored, what may be ignored, and what must never be inferred. The approach breaks memory updates into object detection, field detection, and value extraction steps, each followed by validation gates and retries that keep the write path stateful. This moves interpretation out of later reads, so queries simply retrieve verified records instead of asking the model to reconstruct context from prose. A sympathetic reader would care because agents in production need exact facts, updates, deletions, aggregations, and explicit unknowns that unstructured recall routinely mishandles. The reported benchmarks show the schema-grounded system reaching 97.10% F1 on end-to-end memory tasks and 95.2% accuracy on application tasks, above multiple baselines.

Core claim

The central claim is that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. The authors present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. On the end-to-end memory benchmark the system reaches 97.10% F1 compared with 80.16%-87.24% for baselines; on the

What carries the argument

Iterative schema-aware write path that decomposes ingestion into object detection, field detection, field-value extraction with validation gates, local retries, and stateful prompt control.

If this is right

Memory operations such as updates, deletions, aggregations, and negative queries become reliable because they operate on verified records rather than inferred text.
Object-level accuracy reaches 90.42% and output accuracy 62.67% on structured extraction benchmarks above tested frontier baselines.
End-to-end memory performance reaches 97.10% F1, exceeding third-party baselines that range from 80.16% to 87.24%.
Application-level tasks reach 95.2% accuracy, outperforming specialised memory systems, Markdown harnesses, and frontier-model application harnesses.
For workloads that require stable facts and stateful computation, architecture and schema design matter more than retrieval scale or model strength alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents that run for many turns could maintain consistent long-term state by catching inconsistencies at ingestion rather than accumulating retrieval errors.
Investing effort in domain-specific schemas and write-time validation may deliver higher reliability than further scaling of retrieval indices or model size.
Domains with regulatory constraints on what may be inferred could adopt the same validation gates to enforce explicit unknowns and prevent over-inference.
The design invites direct comparison with traditional database systems adapted for LLM-driven updates, where the same separation of write validation and read queries already exists.

Load-bearing premise

The schemas supplied correctly capture all relevant facts and constraints for the target domains, and the validation gates do not systematically reject valid information or accept invalid information in ways that bias the downstream results.

What would settle it

Run the system on a domain whose schemas are deliberately incomplete for critical facts and measure whether benchmark F1 or application accuracy drops below the reported levels while error rates on state updates rise.

Figures

Figures reproduced from arXiv: 2604.27906 by Alexander Gusak, Alex Petrov, Denis Mukha, Dima Korolev.

**Figure 1.** Figure 1: Iterative extraction pipeline: staged decisions with validation gates and local retries. view at source ↗

**Figure 2.** Figure 2: Prompt engine control flow: prompts evolve from extracted state, and validation view at source ↗

**Figure 3.** Figure 3: Three memory contexts and their merge flow: request context coordinates workers view at source ↗

**Figure 4.** Figure 4: Schema evolution loop: observed questions and failures drive migration proposals; view at source ↗

**Figure 5.** Figure 5: Measurement points in a schema-grounded memory system: write-path extraction, view at source ↗

**Figure 6.** Figure 6: False positives (FP) and false negatives (FN) by query category. Bars extend downward view at source ↗

**Figure 7.** Figure 7: Entropy jump: each time a pipeline uses a tool, calls an API or queries System of view at source ↗

read the original abstract

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable iterative schema pipeline for reliable agent memory that beats baselines on extraction and state tasks, but the gains look tied to upfront schema alignment more than the write-path architecture alone.

read the letter

The main point is that this paper treats memory as a structured write problem instead of retrieval, using schemas plus staged extraction and validation gates to keep facts stable. Their numbers look decent on the surface: 97.1% F1 on the end-to-end benchmark versus 80-87% for the third-party baselines, and 95.2% on the application task. That is the practical takeaway for anyone dealing with drifting state in agents.

Referee Report

3 major / 3 minor

Summary. The paper argues that reliable AI memory requires schema-grounded storage rather than unstructured text retrieval, as the latter fails on exact facts, state updates, aggregations, and negative queries. It introduces an iterative schema-aware write path that decomposes ingestion into object detection, field detection, value extraction, validation gates, local retries, and stateful prompt control. Reads are then reduced to constrained queries over verified records. Evaluation on a structured extraction benchmark shows 90.42% object-level accuracy and 62.67% output accuracy (judge-in-the-loop) above frontier baselines; an end-to-end memory benchmark yields 97.10% F1 versus 80.16-87.24% for third-party systems; and an application-level task reaches 95.2% accuracy, outperforming specialized memory systems and frontier-model harnesses. The central conclusion is that architecture matters more than retrieval scale or model strength for stateful memory workloads.

Significance. If the performance gains are attributable to the iterative write-path design rather than schema alignment alone, the work offers a practical engineering shift from RAG-style memory to structured systems of record. This could matter for agentic applications that require stable facts, updates/deletions, and precise stateful computation, moving memory design from retrieval optimization toward verifiable data models.

major comments (3)

[§4.2] §4.2 (end-to-end memory benchmark): the 97.10% F1 result is reported without rejection rates, false-negative rates on valid extractions, or any control condition using deliberately incomplete or mismatched schemas. Because the benchmarks supply complete, task-matched schemas up front, it is impossible to determine whether the gains derive from the iterative extraction architecture or from schema provision plus gate filtering; this directly undermines the claim that the write-path design is the decisive factor.
[§4.1] §4.1 (structured extraction benchmark): the 90.42% object-level and 62.67% output accuracies are presented without stating whether schema definitions and data splits were fixed before any results were inspected or whether post-hoc refinement occurred. In the absence of this information, the superiority over frontier structured-output baselines cannot be confidently attributed to the method rather than evaluation design choices.
[§4] §4 (baseline comparisons): the paper does not detail how (or whether) equivalent schemas were supplied to the third-party memory systems, code-generated Markdown harnesses, and customer-facing frontier-model harnesses. If xmemory alone receives explicit schema guidance while baselines operate under weaker or absent schema constraints, the reported accuracy gaps (95.2% vs. lower scores) cannot be interpreted as evidence that architecture outperforms retrieval scale or model strength.

minor comments (3)

[Title and abstract] The system is referred to as 'xmemory' in the abstract and results but is not named in the title; adding the name or a short system description to the title would improve discoverability.
[Method section] The description of the iterative loop (object detection → field detection → validation gates → retry) would be clearer with a short pseudocode listing or state-transition diagram in the method section.
[Evaluation tables/figures] Table or figure captions for the benchmark results should explicitly list the exact schemas and judge prompts used, or provide a pointer to the supplementary material containing them.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive comments on our evaluation methodology. We have revised the manuscript to improve transparency around schema provision, pre-specification of evaluation parameters, and additional metrics. Below we respond to each major comment.

read point-by-point responses

Referee: [§4.2] §4.2 (end-to-end memory benchmark): the 97.10% F1 result is reported without rejection rates, false-negative rates on valid extractions, or any control condition using deliberately incomplete or mismatched schemas. Because the benchmarks supply complete, task-matched schemas up front, it is impossible to determine whether the gains derive from the iterative extraction architecture or from schema provision plus gate filtering; this directly undermines the claim that the write-path design is the decisive factor.

Authors: We agree that rejection rates and false-negative rates should have been reported. The revised §4.2 now includes these values computed from our experimental logs. However, the benchmark was intentionally scoped to complete, task-matched schemas because that matches the target use case of schema-grounded memory as a verified system of record. We have added text clarifying this design choice and explaining why incomplete-schema controls fall outside the current evaluation scope. We maintain that the performance gap versus retrieval baselines (which receive equivalent schema information where possible) supports the contribution of the iterative write path. revision: partial
Referee: [§4.1] §4.1 (structured extraction benchmark): the 90.42% object-level and 62.67% output accuracies are presented without stating whether schema definitions and data splits were fixed before any results were inspected or whether post-hoc refinement occurred. In the absence of this information, the superiority over frontier structured-output baselines cannot be confidently attributed to the method rather than evaluation design choices.

Authors: The schema definitions were derived directly from the benchmark specification and the data splits were determined via a fixed, deterministic procedure before any model runs or result inspection occurred. No post-hoc refinement of schemas or splits took place. We have added an explicit statement in the revised §4.1 confirming this pre-specification protocol. revision: yes
Referee: [§4] §4 (baseline comparisons): the paper does not detail how (or whether) equivalent schemas were supplied to the third-party memory systems, code-generated Markdown harnesses, and customer-facing frontier-model harnesses. If xmemory alone receives explicit schema guidance while baselines operate under weaker or absent schema constraints, the reported accuracy gaps (95.2% vs. lower scores) cannot be interpreted as evidence that architecture outperforms retrieval scale or model strength.

Authors: Equivalent schema information was supplied to every baseline that supports structured input. Third-party memory systems and frontier harnesses received the schemas through their native structured interfaces; Markdown harnesses received schema content translated into detailed prompt instructions. The revised §4 now contains a dedicated paragraph and table that documents the exact schema provision method used for each baseline, making the comparison transparent. revision: yes

standing simulated objections not resolved

Control experiment with deliberately incomplete or mismatched schemas on the end-to-end memory benchmark (no such results were generated as the evaluation was scoped to complete schemas)

Circularity Check

0 steps flagged

No circularity: empirical engineering evaluation on external benchmarks

full rationale

The paper describes an iterative schema-aware extraction architecture for AI memory and reports empirical results on structured extraction and end-to-end memory benchmarks (90.42% object-level accuracy, 97.10% F1, 95.2% accuracy). No equations, fitted parameters, predictions derived from those parameters, or self-citations appear in the abstract or evaluation description. The central claims rest on direct comparison to third-party baselines rather than any reduction of outputs to inputs by definition or construction. The method is presented as an engineering design whose performance is measured externally, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on the assumption that domain schemas can be written in advance and that the underlying model can be prompted to follow them reliably. No free parameters are explicitly fitted in the abstract; the method introduces no new physical or mathematical entities.

axioms (2)

domain assumption LLMs can be prompted to perform object detection, field detection, and value extraction when given an explicit schema.
Invoked in the description of the write path; treated as a reliable capability rather than derived.
domain assumption Validation gates and local retries can correct extraction errors without introducing new systematic bias.
Central to the claim that the write path produces verified records.

pith-pipeline@v0.9.0 · 5629 in / 1458 out tokens · 47846 ms · 2026-05-07T06:53:12.097736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 27 canonical work pages · 14 internal anchors

[1]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review arXiv 2025
[2]

Memory operations in large language models: A survey

Yifan Du, Chongyang Huang, Wayne Xin Zhao, Ji-Rong Wen, et al. Rethinking memory in AI: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675,

work page arXiv
[3]

Memory operations in large language models: A survey

doi: 10.48550/arXiv.2505.00675. URLhttps://arxiv.org/abs/2505.00675

work page doi:10.48550/arxiv.2505.00675
[4]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review arXiv 2020
[5]

arXiv preprint arXiv:2004.04906 , year=

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. URLhttps://arxiv.org/abs/2004.04906

work page arXiv 2020
[6]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. doi: 10.48550/arXiv.2504.19413. URL https://arxiv.org/abs/ 2504.19413

work page internal anchor Pith review doi:10.48550/arxiv.2504.19413 2025
[7]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. URLhttps://arxiv.org/abs/1908.10084

work page internal anchor Pith review arXiv 2019
[8]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 2024. URL https://arxiv. org/abs/2307.03172. Published in TACL 2024

work page internal anchor Pith review arXiv 2024
[9]

Deshpande, V

Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, and Peng Wang. MEMTRACK: Evaluating long-term memory and state tracking in 27 multi-platform dynamic agent environments.arXiv preprint arXiv:2510.01353, 2025. doi: 10.48550/arXiv.2510.01353. URLhttps://arxiv.org/abs/2510.01353

work page doi:10.48550/arxiv.2510.01353 2025
[10]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. LoCoMo: Evaluating very long-term conversational memory of LLM agents.arXiv preprint arXiv:2402.17753, 2024. doi: 10.48550/arXiv.2402.17753. URL https://arxiv.org/abs/2402.17753

work page internal anchor Pith review doi:10.48550/arxiv.2402.17753 2024
[11]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley, 2 edition,
[12]

URLhttps://onlinelibrary.wiley.com/doi/book/10.1002/047174882X

work page doi:10.1002/047174882x
[13]

Claude E. Shannon. A mathematical theory of communication.The Bell System Techni- cal Journal, 27(3):379–423, 1948. URL https://people.math.harvard.edu/~ctm/home/ text/others/shannon/entropy/entropy.pdf

1948
[14]

Khare, Scott W

Shizhe He, Avanika Narayan, Ishan S. Khare, Scott W. Linderman, Christopher R´ e, and Dan Biderman. An information theoretic perspective on agentic system design.arXiv preprint arXiv:2512.21720, December 2025. doi: 10.48550/arXiv.2512.21720. URL https: //arxiv.org/abs/2512.21720

work page doi:10.48550/arxiv.2512.21720 2025
[15]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. URLhttps://a...

work page internal anchor Pith review arXiv 2020
[16]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020. URL https://arxiv.org/ abs/1912.08777

work page arXiv 2020
[17]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025. doi: 10.48550/arXiv.2509.04664. URLhttps://arxiv.org/abs/2509.04664

work page internal anchor Pith review doi:10.48550/arxiv.2509.04664 2025
[18]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019. URLhttps://arxiv.org/abs/1901.04085

work page internal anchor Pith review arXiv 1901
[19]

Gorle, Maahe Zehra Kazmi, Ayesha Mohsin, Muhammad Usman Rafique, Zihao He, Pulkit Mehta, Muham- mad Ali Jamshed, and John M

Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Zeeshan Memon, Muham- mad Ibtsaam Qadir, Sagnik Bhattacharya, Hassan Rizwan, Abhiram R. Gorle, Maahe Zehra Kazmi, Ayesha Mohsin, Muhammad Usman Rafique, Zihao He, Pulkit Mehta, Muham- mad Ali Jamshed, and John M. Cioffi. On the fundamental limits of LLMs at scale.arXiv preprint arXiv:2511.12869, 2025. URL...

work page arXiv 2025
[20]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. URL https: //arxiv.org/abs/2404.16130

work page internal anchor Pith review arXiv 2024
[21]

Spider 2.0: Evaluating language models on real- world enterprise text-to-SQL workflows, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi 28 Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real- world enterprise text-to-SQL workflows, 2024. URL https://arxiv.org/abs/2411.07763. ICLR 2025 Oral

work page arXiv 2024
[22]

Large language models for generative information extraction: A survey.arXiv preprint arXiv:2312.17617, 2024

Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, Yang Wang, and Enhong Chen. Large language models for generative information extraction: A survey.arXiv preprint arXiv:2312.17617, 2024. URL https: //arxiv.org/abs/2312.17617

work page arXiv 2024
[23]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651, 2023. URLht...

work page internal anchor Pith review arXiv 2023
[24]

Position: Truly Self-Improving Agents Require Intrinsic Metacognitive Learning

Elliot Meyerson, Giuseppe Paolo, Roberto Dailey, Hormoz Shahrzad, Olivier Francon, Conor F. Hayes, Xin Qiu, Babak Hodjat, and Risto Miikkulainen. Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025. doi: 10.48550/arXiv. 2511.09030. URLhttps://arxiv.org/abs/2511.09030

work page internal anchor Pith review doi:10.48550/arxiv 2025
[25]

Rosen, Gerbrand Ceder, Kristin A

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models.Nature Communications, 15(1):1418, 2024. URLhttps://www.nature.com/articles/s41467-024-45563-x

2024
[26]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, and Bhaskar Mitra. Learning to extract structured entities using language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6817–6834. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024. emnlp-main.38...

work page doi:10.18653/v1/2024 2024
[27]

OneKE: A dockerized schema-guided LLM agent-based knowledge extraction system

Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, and Huajun Chen. OneKE: A dockerized schema-guided LLM agent-based knowledge extraction system. In Companion Proceedings of the ACM Web Conference 2025 (WWW Companion ’25), 2025. doi: 10.1145/3701716.3715189. ...

work page doi:10.1145/3701716.3715189 2025
[28]

Grammar- Aligned Decoding,

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. Grammar-aligned decoding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://arxiv.org/abs/2405.21047

work page arXiv 2024
[29]

Why and where: A characterization of data provenance

Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Why and where: A characterization of data provenance. InInternational Conference on Database Theory (ICDT), 2001. URL https://homepages.inf.ed.ac.uk/opb/papers/ICDT2001.pdf

2001
[30]

Context rot: How increasing input tokens impacts LLM performance

Brandon Hong et al. Context rot: How increasing input tokens impacts LLM performance. Technical report, Chroma, 2025. URLhttps://research.trychroma.com/context-rot

2025
[31]

YAML ain’t markup language (YAML) version 1.2.2

YAML Language Development Team. YAML ain’t markup language (YAML) version 1.2.2. https://yaml.org/spec/1.2.2/, 2021

2021
[32]

JSON schema: A media type for describing JSON documents (draft 2020-12).https://json-schema.org/draft/2020-12/json-schema-core, 2020

JSON Schema Authors. JSON schema: A media type for describing JSON documents (draft 2020-12).https://json-schema.org/draft/2020-12/json-schema-core, 2020. 29

2020
[33]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. URLhttps://arxiv.org/abs/2210.03629

work page internal anchor Pith review arXiv 2022
[34]

NoSQL schema evolution and data migra- tion

Uta St¨ orl, Meike Klettke, and Stefanie Scherzinger. NoSQL schema evolution and data migra- tion. InProceedings of the 23rd International Conference on Extending Database Technology (EDBT), 2020. URLhttps://openproceedings.org/2020/conf/edbt/paper_T4.pdf

2020
[35]

A generic schema evolution approach for NoSQL and re- lational databases.IEEE Transactions on Knowledge and Data Engineer- ing, 2024

Alberto Hern´ andez Chill´ on, Meike Klettke, Diego Sevilla Ruiz, and Jes´ us Garc´ ıa Molina. A generic schema evolution approach for NoSQL and re- lational databases.IEEE Transactions on Knowledge and Data Engineer- ing, 2024. URL https://epub.uni-regensburg.de/77266/1/A_Generic_Schema_ Evolution_Approach_for_NoSQL_and_Relational_Databases.pdf

2024
[36]

LLM structured output benchmarks are riddled with mis- takes, 2025

Jonas Mueller Hui Wen Goh. LLM structured output benchmarks are riddled with mis- takes, 2025. URL https://cleanlab.ai/blog/structured-output-benchmark/. Ac- cessed: 2026-04-16

2025
[37]

Cognee github repository and readme

Topoteretes. Cognee github repository and readme. https://github.com/topoteretes/ cognee, 2026. Accessed 2026-04-22

2026
[38]

Mem0 documentation: Build with mem0

Mem0. Mem0 documentation: Build with mem0. https://docs.mem0.ai/introduction,
[39]

Supermemory documentation: Overview — what is supermemory? https: //supermemory.ai/docs/intro, 2026

Supermemory. Supermemory documentation: Overview — what is supermemory? https: //supermemory.ai/docs/intro, 2026. Accessed 2026-04-22

2026
[40]

Zep documentation and platform overview

Zep. Zep documentation and platform overview. https://www.getzep.com, 2026. Accessed 2026-04-24

2026
[41]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- MemEval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. doi: 10.48550/arXiv.2410.10813. URL https://arxiv.org/abs/ 2410.10813

work page internal anchor Pith review doi:10.48550/arxiv.2410.10813 2024
[42]

Benchmarking ai agent memory

Letta. Benchmarking ai agent memory. https://www.letta.com/blog/ benchmarking-ai-agent-memory, 2026. Accessed 2026-04-24

2026
[43]

entropy jump

snap-research and community contributors. Locomo issue discussion: Dataset label quality estimate. https://github.com/snap-research/locomo/issues/27# issuecomment-3921992262, 2025. Accessed 2026-04-24. 30 A Appendix A.1 Information-theoretic intuition Extracting structured facts from language is a transition from a high-entropy representation to a low-ent...

2025