arxiv: 2402.17753 · v1 · submitted 2024-02-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana , Dong-Ho Lee , Sergey Tulyakov , Mohit Bansal , Francesco Barbieri , Yuwei Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords long-term memoryconversational agentsLLM evaluationdialogue systemslong-context modelsretrieval augmented generationmulti-modal dialogueevent summarization

0 comments

The pith

LLMs struggle to track events and relationships across hundreds of dialogue turns even with long contexts or retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a new dataset of very long conversations averaging 300 turns and 9,000 tokens across up to 35 sessions. These dialogues are produced by a pipeline in which LLM agents converse while grounded in fixed personas and temporal event graphs, share images, and receive human edits to ensure consistency. A benchmark is then built around the dataset to test three capabilities: answering questions that require long-range recall, summarizing events spread over time, and generating responses that incorporate past multi-modal context. Experiments show that existing models, including those with extended context windows and retrieval augmentation, perform poorly on these tasks and remain far below human levels.

Core claim

LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

What carries the argument

A machine-human pipeline that generates dialogues by LLM-based agents grounded on personas and temporal event graphs, with image sharing and human verification for long-range consistency, producing the LoCoMo dataset and its associated benchmark tasks.

If this is right

Long-range temporal and causal reasoning in dialogue requires new mechanisms beyond current context-extension and retrieval methods.
Multi-modal elements such as shared images add further memory demands that existing models handle poorly.
Very long-term conversational agents will need explicit memory architectures to approach human consistency.
Benchmarks limited to five sessions underestimate the difficulty of maintaining coherence over dozens of sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a training signal to improve long-term memory through targeted fine-tuning or reinforcement learning.
Real-world personal assistants that maintain months-long relationships would likely show the same gaps observed here.
Extending the event-graph grounding to even longer horizons might expose additional failure modes in current techniques.

Load-bearing premise

The pipeline's generated dialogues, after human editing, are natural and representative enough of real multi-session conversations that model failures on them reflect genuine memory limitations.

What would settle it

A model or technique that reaches human-level accuracy on the question-answering and event-summarization tasks over the full 300-turn conversations would show that the reported challenges are not fundamental.

read the original abstract

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to address the lack of evaluation for very long-term open-domain dialogues (beyond five sessions) by introducing LoCoMo, a dataset of conversations averaging 300 turns and 9K tokens across up to 35 sessions. Dialogues are created via a machine-human pipeline in which LLM agents grounded in personas and temporal event graphs generate content (including image sharing), followed by human verification and editing for long-range consistency. The authors define a benchmark with three tasks—question answering, event summarization, and multi-modal dialogue generation—and report that LLMs struggle with lengthy contexts and long-range temporal/causal dynamics, with long-context models and RAG providing only partial improvements that still substantially lag human performance.

Significance. If the LoCoMo benchmark is shown to be a faithful proxy for natural long-term conversational dynamics, the work would be significant for conversational AI research. It supplies the first large-scale resource focused on very long-term memory and identifies concrete weaknesses in current models' handling of temporal and causal structure, which could guide development of improved memory architectures, retrieval methods, and agent designs. The incorporation of multi-modal image sharing and the human-verified pipeline are positive contributions that increase ecological validity over purely synthetic setups.

major comments (3)

[LoCoMo Dataset Construction (pipeline description)] The headline claim that LLMs 'substantially lag behind humans' in long-range temporal and causal dynamics rests on LoCoMo being representative of real very long-term conversations. The machine-human pipeline generates dialogues from LLM agents on synthetic event graphs before human editing; this construction risks embedding LLM-specific artifacts (limited topic drift, artificial consistency, or event-chain regularities) that may not match organic human dialogue. Without additional validation—such as side-by-side comparison of model performance on LoCoMo versus naturally occurring long-term dialogues—the generalizability of the lag finding remains uncertain.
[Experimental Results] The experimental results section asserts that long-context LLMs and RAG still lag humans but supplies no concrete quantitative metrics (accuracy, F1, ROUGE, or statistical tests), model specifications (exact context windows, RAG hyperparameters, retrieval settings), or error analysis broken down by task or temporal distance. These omissions make it impossible to judge the size of the performance gap or to reproduce the central claim.
[Benchmark Tasks] For the multi-modal dialogue generation task, the manuscript does not specify how image context is provided to models versus human evaluators or how image-grounding is scored. This detail is load-bearing for the fairness of the human-model comparison and for claims about multi-modal long-term memory.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two headline quantitative results (e.g., best-model vs. human scores on the QA task) to convey the magnitude of the reported gap.
[Dataset Construction] Clarify the exact number of human annotators, inter-annotator agreement, and editing guidelines used in the verification stage to allow readers to assess the reliability of the final dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about experimental details, task specifications, and dataset construction. Our point-by-point responses follow.

read point-by-point responses

Referee: The headline claim that LLMs 'substantially lag behind humans' in long-range temporal and causal dynamics rests on LoCoMo being representative of real very long-term conversations. The machine-human pipeline generates dialogues from LLM agents on synthetic event graphs before human editing; this construction risks embedding LLM-specific artifacts (limited topic drift, artificial consistency, or event-chain regularities) that may not match organic human dialogue. Without additional validation—such as side-by-side comparison of model performance on LoCoMo versus naturally occurring long-term dialogues—the generalizability of the lag finding remains uncertain.

Authors: We agree that representativeness is essential for the strength of our claims. The human verification and editing phase was specifically designed to enforce long-range consistency and grounding to the temporal event graphs, which substantially reduces LLM-induced artifacts such as unnatural consistency or limited topic drift. In the revision, we have expanded the pipeline description with additional details on annotator guidelines, the distribution of edit types (e.g., temporal corrections, persona consistency fixes), and inter-annotator agreement statistics. We also explicitly discuss the limitation that no public naturally occurring very long-term open-domain dialogue datasets exist for direct comparison, which is precisely why we developed this resource; this is now stated in the limitations section. revision: partial
Referee: The experimental results section asserts that long-context LLMs and RAG still lag humans but supplies no concrete quantitative metrics (accuracy, F1, ROUGE, or statistical tests), model specifications (exact context windows, RAG hyperparameters, retrieval settings), or error analysis broken down by task or temporal distance. These omissions make it impossible to judge the size of the performance gap or to reproduce the central claim.

Authors: We apologize for these omissions in the original submission. The revised manuscript now reports full quantitative results, including accuracy and F1 for question answering, ROUGE and BERTScore for event summarization, and task-specific metrics for multi-modal generation. All model specifications are provided (context window sizes, exact RAG parameters including chunk size, top-k, and retriever), along with statistical significance tests. A new error analysis subsection breaks down failures by task and by temporal distance (short-range vs. long-range events), with examples. revision: yes
Referee: For the multi-modal dialogue generation task, the manuscript does not specify how image context is provided to models versus human evaluators or how image-grounding is scored. This detail is load-bearing for the fairness of the human-model comparison and for claims about multi-modal long-term memory.

Authors: We have added a dedicated paragraph in the benchmark tasks section clarifying the multi-modal protocol. Models receive image context either as generated textual captions or direct vision-language model input (depending on the model type), while human evaluators see the original images. Image-grounding is evaluated via a hybrid approach: human ratings on relevance and conversational consistency, plus automated CLIP similarity between generated text and image content. These details are now explicit to support fair comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with new data and no derivations

full rationale

The paper creates LoCoMo via an LLM-agent pipeline grounded in personas and event graphs, followed by human verification and editing, then runs standard QA/summarization/generation benchmarks comparing models to humans. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. The central claim (models lag humans on long-range temporal/causal tasks) rests on fresh experimental results rather than any self-referential reduction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark paper with no mathematical free parameters or invented physical entities. The pipeline assumes LLM agents can produce consistent long dialogues when given personas and event graphs, which is a domain assumption rather than a derived result.

axioms (1)

domain assumption LLM-based agents can generate high-quality, long-range consistent dialogues when grounded on personas and temporal event graphs
Invoked to justify the machine-human pipeline for creating the LoCoMo dataset.

pith-pipeline@v0.9.0 · 5543 in / 1324 out tokens · 61300 ms · 2026-05-12T07:59:35.358347+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
cs.CL 2026-04 unverdicted novelty 7.0

OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
cs.CL 2026-05 conditional novelty 6.0

True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
cs.AI 2026-05 unverdicted novelty 6.0

MEMTIER delivers 38% accuracy on the 500-question LongMemEval-S benchmark with a 7B model on 6GB GPU, a 33-point gain over full-context baselines, via structured episodic memory, five-signal retrieval, and semantic co...
MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

MemORAI combines selective filtering, provenance tracking in multi-relational graphs, and dynamic weighted PageRank retrieval to achieve state-of-the-art memory retrieval and personalized responses in LLM agents on LO...
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
cs.CV 2026-04 unverdicted novelty 6.0

EviMem improves accuracy on temporal and multi-hop questions in long-term conversational memory by iteratively diagnosing and filling evidence gaps, achieving 81.6% and 85.2% judge accuracy on LoCoMo at 4.5x lower lat...
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
To Know is to Construct: Schema-Constrained Generation for Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms ret...
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
cs.CL 2026-04 unverdicted novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
cs.CL 2026-04 unverdicted novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying
cs.CR 2026-04 unverdicted novelty 6.0

ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
cs.IR 2026-04 conditional novelty 6.0

SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
A-MEM: Agentic Memory for LLM Agents
cs.CL 2025-02 unverdicted novelty 6.0

A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
cs.CL 2026-04 unverdicted novelty 5.0

EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
EgoSelf: From Memory to Personalized Egocentric Assistant
cs.CV 2026-04 unverdicted novelty 5.0

EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
cs.AI 2026-04 unverdicted novelty 5.0

Layered mutability framework claims governance difficulty in persistent self-modifying agents rises with rapid mutation, strong downstream coupling, weak reversibility, and low observability, producing compositional d...
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
cs.AI 2026-04 unverdicted novelty 5.0

Persistent self-modifying AI agents exhibit compositional drift from mismatches across five mutability layers, with governance difficulty rising under rapid mutation, strong coupling, weak reversibility, and low obser...
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
cs.CL 2026-05 unverdicted novelty 4.0

MemReranker applies multi-teacher pairwise distillation, BCE pointwise training, and InfoNCE contrastive learning on mixed general and memory-specific dialogue data to produce efficient rerankers that improve calibrat...

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · cited by 31 Pith papers · 3 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Synthetic QA Corpora Generation with Roundtrip Consistency , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[9]

Transactions of the Association for Computational Linguistics , volume=

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[10]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

work page
[11]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Beyond Goldfish Memory: Long-Term Open-Domain Conversation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[12]

arXiv preprint arXiv:2001.09977 , year=

Towards a human-like open-domain chatbot , author=. arXiv preprint arXiv:2001.09977 , year=

work page arXiv 2001
[13]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

A Persona-Based Neural Conversation Model , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[14]

Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages=

Data-driven response generation in social media , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages=

work page
[15]

2000 , publisher=

A stochastic model of human-machine interaction for learning dialog strategies , journal=. 2000 , publisher=

work page 2000
[16]

Computer Speech & Language , volume=

Trainable approaches to surface natural language generation and their application to conversational dialog systems , author=. Computer Speech & Language , volume=. 2002 , publisher=

work page 2002
[18]

International Conference on Learning Representations , year=

Wizard of Wikipedia: Knowledge-Powered Conversational Agents , author=. International Conference on Learning Representations , year=

work page
[21]

Proceedings of the 31st ACM International Conference on Multimedia , pages=

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

work page
[23]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Time-Stamped Language Model: Teaching Language Models to Understand The Flow of Events , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[25]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

arXiv preprint arXiv:2310.14804 , year=

Large Language Models can Share Images, Too! , author=. arXiv preprint arXiv:2310.14804 , year=

work page arXiv
[28]

The Twelfth International Conference on Learning Representations , year=

BooookScore: A systematic exploration of book-length summarization in the era of LLMs , author=. The Twelfth International Conference on Learning Representations , year=

work page
[30]

Transactions of the Association for Computational Linguistics , volume=

Time-aware language models as temporal knowledge bases , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[31]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Open-Domain Question Answering Goes Conversational via Question Rewriting , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[38]

Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

Hierarchical Transformers Are More Efficient Language Models , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

work page 2022
[39]

Advances in Neural Information Processing Systems , volume=

Unlimiformer: Long-range transformers with unlimited length input , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

arXiv preprint arXiv:2310.05029 , year=

Walking down the memory maze: Beyond context limit through interactive reading , author=. arXiv preprint arXiv:2310.05029 , year=

work page arXiv
[41]

arXiv preprint arXiv:2305.17493

Model Dementia: Generated Data Makes Models Forget , author=. arXiv preprint arXiv:2305.17493 , year=

work page arXiv
[45]

New german critique , number=

Collective memory and cultural identity , author=. New german critique , number=. 1995 , publisher=

work page 1995
[46]

Memory , volume=

Towards a psychology of collective memory , author=. Memory , volume=. 2008 , publisher=

work page 2008
[47]

Trends in cognitive sciences , volume=

Collective memory from a psychological perspective , author=. Trends in cognitive sciences , volume=. 2018 , publisher=

work page 2018
[48]

Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage

Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage , author=. arXiv preprint arXiv:2208.03188 , year=

work page arXiv
[52]

Proceedings of the 2003 conference on Designing for user experiences , pages=

Personas: practice and theory , author=. Proceedings of the 2003 conference on Designing for user experiences , pages=

work page 2003
[53]

1999 , publisher=

The inmates are running the asylum , author=. 1999 , publisher=

work page 1999
[55]

Computational Linguistics , volume=

The design and implementation of xiaoice, an empathetic social chatbot , author=. Computational Linguistics , volume=. 2020 , publisher=

work page 2020
[57]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

DialogCC: An Automated Pipeline for Creating High-Quality Multi-modal Dialogue Datasets , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

work page 2023
[58]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[59]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Visual dialog , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[61]

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

Mostafazadeh, Nasrin and Brockett, Chris and Dolan, Bill and Galley, Michel and Gao, Jianfeng and Spithourakis, Georgios and Vanderwende, Lucy. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2017

work page 2017
[64]

MMC hat: Multi-Modal Chat Dataset on Social Media

Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian. MMC hat: Multi-Modal Chat Dataset on Social Media. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

work page 2022
[66]

A ug ESC : Dialogue Augmentation with Large Language Models for Emotional Support Conversation

Zheng, Chujie and Sabour, Sahand and Wen, Jiaxin and Zhang, Zheng and Huang, Minlie. A ug ESC : Dialogue Augmentation with Large Language Models for Emotional Support Conversation. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.99

work page doi:10.18653/v1/2023.findings-acl.99 2023
[69]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[70]

I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[71]

Advances in Neural Information Processing Systems , volume=

Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

Advances in Neural Information Processing Systems , volume=

A simple language model for task-oriented dialogue , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

DynaEval: Unifying Turn and Dialogue Level Evaluation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[75]

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue , pages=

Leveraging Large Language Models for Automated Dialogue Analysis , author=. Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue , pages=

work page
[76]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page
[77]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[78]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[83]

Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

work page 2022
[84]

2023 , booktitle=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , booktitle=

work page 2023
[85]

arXiv preprint arXiv:2401.05654 , year=

Towards Conversational Diagnostic AI , author=. arXiv preprint arXiv:2401.05654 , year=

work page arXiv
[86]

2022 IEEE International Conference on Data Mining Workshops (ICDMW) , pages=

Persona-Based Conversational AI: State of the Art and Challenges , author=. 2022 IEEE International Conference on Data Mining Workshops (ICDMW) , pages=. 2022 , organization=

work page 2022
[87]

Keep Me Updated! Memory Management in Long-term Conversations

Bae, Sanghwan and Kwak, Donghyun and Kang, Soyoung and Lee, Min Young and Kim, Sungdong and Jeong, Yuin and Kim, Hyeri and Lee, Sang-Woo and Park, Woomyoung and Sung, Nako. Keep Me Updated! Memory Management in Long-term Conversations. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.276

work page doi:10.18653/v1/2022.findings-emnlp.276 2022
[88]

arXiv preprint arXiv:2308.15022 , year=

Recursively summarizing enables long-term dialogue memory in large language models , author=. arXiv preprint arXiv:2308.15022 , year=

work page arXiv
[89]

Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Memory sandbox: Transparent and interactive memory management for conversational agents , author=. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[90]

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , author=. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[95]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

Retrieval Augmentation Reduces Hallucination in Conversation , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[97]

Annual review of psychology , volume=

Remembering in conversations: The social sharing and reshaping of memories , author=. Annual review of psychology , volume=. 2012 , publisher=

work page 2012
[98]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

work page
[99]

The Twelfth International Conference on Learning Representations , year=

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[100]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

work page 2022
[101]

EmailSum: Abstractive Email Thread Summarization , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[102]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

SummScreen: A Dataset for Abstractive Screenplay Summarization , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[103]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

work page
[104]

Jaewoo Ahn, Yeda Song, Sangdoo Yun, and Gunhee Kim. 2023. https://doi.org/10.18653/v1/2023.acl-long.189 MPCHAT : Towards multimodal persona-grounded conversation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3354--3377, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-long.189 2023
[105]

Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520--534

work page 2021
[106]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425--2433

work page 2015
[107]

Jan Assmann and John Czaplicka. 1995. Collective memory and cultural identity. New german critique, (65):125--133

work page 1995
[108]

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. 2024. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36

work page 2024
[109]

Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2023. Booookscore: A systematic exploration of book-length summarization in the era of llms. In The Twelfth International Conference on Learning Representations

work page 2023
[110]

Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. 2022. Summscreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602--8615

work page 2022
[111]

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations

work page 2023
[112]

Alan Cooper. 1999. The inmates are running the asylum. Springer

work page 1999
[113]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359

work page 2022
[114]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos \'e MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326--335

work page 2017
[115]

Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. 2023. https://doi.org/10.18653/v1/2023.acl-long.405 MMD ialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:...

work page doi:10.18653/v1/2023.acl-long.405 2023
[116]

Silin Gao, Beatriz Borges, Soyoung Oh, Deniz Bayazit, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, and Antoine Bosselut. 2023 a . https://doi.org/10.18653/v1/2023.acl-long.362 P ea C o K : Persona commonsense knowledge for consistent and engaging narratives . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume ...

work page doi:10.18653/v1/2023.acl-long.362 2023
[117]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-main.398 Enabling large language models to generate text with citations . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465--6488, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.398 2023

Showing first 80 references.