MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

Jialun Wu; Qiyang Xie; Shuai Xiao; Su Liu; Weikai Zhou; Xinjie He; Zhiyuan Lin

arxiv: 2606.04442 · v1 · pith:TD6VEB7Jnew · submitted 2026-06-03 · 💻 cs.CL · cs.AI

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

Qiyang Xie , Jialun Wu , Xinjie He , Su Liu , Shuai Xiao , Zhiyuan Lin , Weikai Zhou This is my paper

Pith reviewed 2026-06-28 06:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords benchmark datasetconversational memorylong document reasoninghybrid questionsretrieval augmented generationmulti-session conversationsjoint reasoning

0 comments

The pith

A new benchmark shows retrieval systems cannot jointly handle conversational memory and long document navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MemoryDocDataSet to test AI systems on tasks that require both navigating multi-session conversation history and performing deep reading within long documents. Existing benchmarks address only one capability at a time. The dataset builds 50 micro-worlds containing personas, temporal event graphs, real long legal documents, grounded conversations, and question sets dominated by hybrid items. These hybrid items force a system to first recall from memory which document is relevant before extracting the answer. Baseline tests across truncated context, long-context models, and retrieval methods expose a clear performance gap on the hybrid items that separate document retrieval from memory use.

Core claim

MemoryDocDataSet comprises 50 micro-worlds, each with 3-5 personas, a months-long temporal event graph, 3-5 long documents of 20,000-50,000 tokens from the Caselaw Access Project, multi-session conversations, and 20 QA pairs. Seventy-five point one percent of questions carry the Hybrid tag, meaning they require first traversing conversation history to identify the relevant document and only then reading inside that document for the answer. Six baseline configurations were run; the strongest reaches 0.358 overall F1 while document-only retrieval falls to 0.267 on Hybrid questions despite scoring 0.453 on document-only questions.

What carries the argument

The Hybrid source tag, which marks questions that demand sequential use of conversation history to select the correct long document before performing extraction inside it.

If this is right

Systems must combine conversational memory mechanisms with document retrieval rather than running them as independent modules.
Success on pure document retrieval tasks does not transfer to tasks where memory first selects the document.
New model designs that unify memory navigation with long-context reading are needed to close the observed gap.
The released generation pipeline supplies a repeatable way to create additional test instances for measuring progress on the joint task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a testbed for evaluating whether simply increasing context length closes the hybrid gap or whether explicit memory structures remain necessary.
Because the long documents come from legal case law, the benchmark may highlight limitations in current legal AI tools that handle ongoing client discussions.
The construction method might be reused to generate similar joint-reasoning tests in other domains that pair extended records with repeated user exchanges.

Load-bearing premise

The synthetic micro-worlds, event graphs, and hybrid questions generated by the pipeline reflect the joint memory-plus-document challenge that occurs in real user interactions without introducing artificial ease or difficulty.

What would settle it

Measuring the same baselines on a collection of genuine multi-session conversations attached to long documents and checking whether hybrid-style questions still produce a comparable performance collapse relative to document-only questions.

Figures

Figures reproduced from arXiv: 2606.04442 by Jialun Wu, Qiyang Xie, Shuai Xiao, Su Liu, Weikai Zhou, Xinjie He, Zhiyuan Lin.

**Figure 2.** Figure 2: Seven-stage generation and evaluation pipeline. Each stage is independently reproducible; the pipeline [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $\kappa = 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemoryDocDataSet gives a concrete synthetic benchmark for joint chat memory and long-document tasks, with a clear baseline gap on hybrid questions, but the generation process leaves room for artifacts.

read the letter

The paper introduces MemoryDocDataSet with 50 micro-worlds built from temporal event graphs, personas, multi-session conversations, and real caselaw documents. The key addition is the Hybrid question type, which requires first using conversation history to select the relevant document then extracting the answer from it. They release the full dataset, pipeline, and baseline code.

What stands out is the empirical demonstration that document-only RAG scores 0.453 F1 on doc-only questions but falls to 0.267 on Hybrid, while the combined RAG-Both reaches only 0.358 overall. This shows a measurable joint-retrieval shortfall in current methods.

The main limitation is the fully synthetic construction. Questions are generated through LLM pipelines on event graphs, so it is possible that spurious cues in the question text or graph allow partial solving without true joint navigation. The reported median Cohen's κ of 0.634 only checks prompt sensitivity of the LLM judge; it does not test whether an oracle that ignores conversation history can still answer a non-trivial share of Hybrid items. No human validation or leakage audit is described.

The work is aimed at researchers building retrieval or memory systems for domains like legal or knowledge management. Anyone evaluating long-context or conversational models would find the released artifacts directly usable for comparison.

It deserves peer review. The idea of a joint benchmark is timely, the numbers are reported clearly, and the release lowers the barrier for follow-up work even if the synthetic artifacts need tighter controls in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs designed to evaluate joint conversational memory and long-document reasoning. Each micro-world includes 3-5 personas, a temporal event graph, 3-5 long documents (20k-50k tokens from the Caselaw Access Project), multi-session conversations, and 20 QA pairs across five categories, with 75.1% labeled as Hybrid questions that require first navigating conversation history to select the relevant document before extracting the answer. Baseline evaluations across truncated context, long-context LLMs, RAG variants, and memory systems show RAG-Both achieving the highest scores (0.358 overall F1, 0.342 on Hybrid), while RAG-Doc drops from 0.453 F1 on Doc-only questions to 0.267 on Hybrid. Dataset quality is assessed via LLM-as-judge prompt-sensitivity self-consistency (median Cohen's κ=0.634), and all artifacts including the generation pipeline and baseline code are released.

Significance. If the synthetic generation process produces Hybrid questions that genuinely require joint navigation of conversation history and document content without leakage or artifacts, the benchmark would address a clear gap by providing the first dedicated resource for this combined capability. The reported performance gap supplies concrete motivation for unified architectures, and the open release of the full pipeline and implementations is a strength that enables direct extensions and reproducibility.

major comments (3)

[Dataset construction and quality assessment] The description of Hybrid question generation (via temporal event graphs, persona-grounded conversations, and LLM-based creation) provides no audit or oracle test showing that an agent ignoring conversation history cannot solve a non-trivial fraction of these items using only the question text or event graph. This is load-bearing for the central claim, as the reported RAG-Doc collapse from 0.453 F1 (Doc-only) to 0.267 F1 (Hybrid) is only interpretable as evidence of a joint-retrieval gap if conversation history is verifiably required to identify the document.
[Dataset construction and quality assessment] The quality characterization reports only median Cohen's κ=0.634 for LLM judge self-consistency across prompt variations; this does not address whether the generated Hybrid questions contain spurious cues or fail to demand the claimed joint reasoning, leaving the task validity unverified.
[Experiments and baselines] Baseline implementation details are insufficient to determine whether the Hybrid performance gap arises from the intended joint challenge or from specific choices in retrieval indexing, chunking of 20k-50k token documents, or memory integration in RAG-Both vs. RAG-Doc.

minor comments (2)

[Abstract] The abstract states there are 'five reasoning categories' but does not enumerate them; this should be listed explicitly for clarity.
[Abstract] The source of the long documents is given as the Caselaw Access Project but lacks a citation or link in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the Hybrid question category and more detailed baseline descriptions. We agree these points are important for establishing the benchmark's validity and will incorporate additional analyses and details in the revision. We respond to each major comment below.

read point-by-point responses

Referee: The description of Hybrid question generation (via temporal event graphs, persona-grounded conversations, and LLM-based creation) provides no audit or oracle test showing that an agent ignoring conversation history cannot solve a non-trivial fraction of these items using only the question text or event graph. This is load-bearing for the central claim, as the reported RAG-Doc collapse from 0.453 F1 (Doc-only) to 0.267 F1 (Hybrid) is only interpretable as evidence of a joint-retrieval gap if conversation history is verifiably required to identify the document.

Authors: We acknowledge that an explicit oracle test would provide stronger evidence that Hybrid questions require conversation history. The generation process grounds conversations in the temporal event graph such that each multi-session exchange references a specific document, and questions are constructed to necessitate identifying the relevant document from history before extraction. However, we agree this does not substitute for a direct test. In the revision we will add an oracle analysis evaluating a no-history baseline (question text plus event graph only) on the Hybrid subset to quantify any leakage, with results reported in a new validation subsection. revision: yes
Referee: The quality characterization reports only median Cohen's κ=0.634 for LLM judge self-consistency across prompt variations; this does not address whether the generated Hybrid questions contain spurious cues or fail to demand the claimed joint reasoning, leaving the task validity unverified.

Authors: The reported median Cohen's κ quantifies labeling consistency of the LLM judge across prompt variations for source-type assignment. We agree this metric alone does not rule out spurious cues or confirm joint-reasoning demands. To address this, the revision will include additional validation: (1) manual review of a random sample of 100 Hybrid questions with inter-annotator agreement, and (2) the no-history oracle results noted above to directly test for cues independent of conversation history. revision: yes
Referee: Baseline implementation details are insufficient to determine whether the Hybrid performance gap arises from the intended joint challenge or from specific choices in retrieval indexing, chunking of 20k-50k token documents, or memory integration in RAG-Both vs. RAG-Doc.

Authors: We agree that insufficient implementation details limit interpretability of the performance gap. The revision will expand the Experiments section with precise specifications: retrieval indexing method and embedding model, chunk size/overlap for the 20k-50k token documents, exact prompting and memory state management for RAG-Both versus RAG-Doc, and hyperparameter settings. We will also ensure the released code repository contains the exact configurations used for all reported baselines. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or fitted predictions

full rationale

This is a dataset-construction and baseline-evaluation paper. The central claims consist of measured F1 scores on released artifacts (RAG-Both at 0.358 overall, RAG-Doc drop on Hybrid questions) and a prompt-sensitivity consistency check (median Cohen's κ). No equations, parameter fits, predictions, uniqueness theorems, or self-citation chains appear in the derivation of results. All reported numbers are direct empirical measurements on the synthetic micro-worlds and QA pairs; the generation pipeline is described but not invoked as a mathematical derivation that reduces to its own outputs. The paper is therefore self-contained against external benchmarks with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, unproved axioms, or new invented entities are described. The benchmark relies on standard synthetic data generation practices and real documents from the Caselaw Access Project, but construction details are not provided.

pith-pipeline@v0.9.1-grok · 5843 in / 1138 out tokens · 29627 ms · 2026-06-28T06:54:40.355016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references

[1]

Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

2020
[2]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13851–13870, 2024

2024
[3]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the Thirteenth International Conference on Learning Representations, 2025

2025
[4]

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Instituting Standardized Evaluation for Long Context Language Models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14388–14411, 2024

2024
[5]

ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7977–7989, 2023

2023
[6]

Beyond Goldfish Memory: Long-Term Open-Domain Conversa- tion

Jing Xu, Arthur Szlam, and Jason Weston. Beyond Goldfish Memory: Long-Term Open-Domain Conversa- tion. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5180–5197, 2022. 13

2022
[7]

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

2025
[8]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018
[9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025

2025
[10]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025

2025
[11]

Caselaw Access Project

Harvard Law School. Caselaw Access Project. 2018

2018
[12]

Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[13]

Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019

2019
[14]

Chroma: The AI-Native Open-Source Embedding Database

Chroma. Chroma: The AI-Native Open-Source Embedding Database. 2022

2022
[15]

{domain}

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. Appendix A Pipeline Stage Details Stage 1 — Document Collection.Raw opinions are downloaded from the Caselaw ...

2024

[1] [1]

Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

2020

[2] [2]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13851–13870, 2024

2024

[3] [3]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the Thirteenth International Conference on Learning Representations, 2025

2025

[4] [4]

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Instituting Standardized Evaluation for Long Context Language Models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14388–14411, 2024

2024

[5] [5]

ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7977–7989, 2023

2023

[6] [6]

Beyond Goldfish Memory: Long-Term Open-Domain Conversa- tion

Jing Xu, Arthur Szlam, and Jason Weston. Beyond Goldfish Memory: Long-Term Open-Domain Conversa- tion. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5180–5197, 2022. 13

2022

[7] [7]

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

2025

[8] [8]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018

[9] [9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025

2025

[10] [10]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025

2025

[11] [11]

Caselaw Access Project

Harvard Law School. Caselaw Access Project. 2018

2018

[12] [12]

Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[13] [13]

Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019

2019

[14] [14]

Chroma: The AI-Native Open-Source Embedding Database

Chroma. Chroma: The AI-Native Open-Source Embedding Database. 2022

2022

[15] [15]

{domain}

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. Appendix A Pipeline Stage Details Stage 1 — Document Collection.Raw opinions are downloaded from the Caselaw ...

2024