Recognition: no theorem link
Evaluating Very Long-Term Conversational Memory of LLM Agents
Pith reviewed 2026-05-12 07:59 UTC · model grok-4.3
The pith
LLMs struggle to track events and relationships across hundreds of dialogue turns even with long contexts or retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.
What carries the argument
A machine-human pipeline that generates dialogues by LLM-based agents grounded on personas and temporal event graphs, with image sharing and human verification for long-range consistency, producing the LoCoMo dataset and its associated benchmark tasks.
If this is right
- Long-range temporal and causal reasoning in dialogue requires new mechanisms beyond current context-extension and retrieval methods.
- Multi-modal elements such as shared images add further memory demands that existing models handle poorly.
- Very long-term conversational agents will need explicit memory architectures to approach human consistency.
- Benchmarks limited to five sessions underestimate the difficulty of maintaining coherence over dozens of sessions.
Where Pith is reading between the lines
- The benchmark could serve as a training signal to improve long-term memory through targeted fine-tuning or reinforcement learning.
- Real-world personal assistants that maintain months-long relationships would likely show the same gaps observed here.
- Extending the event-graph grounding to even longer horizons might expose additional failure modes in current techniques.
Load-bearing premise
The pipeline's generated dialogues, after human editing, are natural and representative enough of real multi-session conversations that model failures on them reflect genuine memory limitations.
What would settle it
A model or technique that reaches human-level accuracy on the question-answering and event-summarization tasks over the full 300-turn conversations would show that the reported challenges are not fundamental.
read the original abstract
Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address the lack of evaluation for very long-term open-domain dialogues (beyond five sessions) by introducing LoCoMo, a dataset of conversations averaging 300 turns and 9K tokens across up to 35 sessions. Dialogues are created via a machine-human pipeline in which LLM agents grounded in personas and temporal event graphs generate content (including image sharing), followed by human verification and editing for long-range consistency. The authors define a benchmark with three tasks—question answering, event summarization, and multi-modal dialogue generation—and report that LLMs struggle with lengthy contexts and long-range temporal/causal dynamics, with long-context models and RAG providing only partial improvements that still substantially lag human performance.
Significance. If the LoCoMo benchmark is shown to be a faithful proxy for natural long-term conversational dynamics, the work would be significant for conversational AI research. It supplies the first large-scale resource focused on very long-term memory and identifies concrete weaknesses in current models' handling of temporal and causal structure, which could guide development of improved memory architectures, retrieval methods, and agent designs. The incorporation of multi-modal image sharing and the human-verified pipeline are positive contributions that increase ecological validity over purely synthetic setups.
major comments (3)
- [LoCoMo Dataset Construction (pipeline description)] The headline claim that LLMs 'substantially lag behind humans' in long-range temporal and causal dynamics rests on LoCoMo being representative of real very long-term conversations. The machine-human pipeline generates dialogues from LLM agents on synthetic event graphs before human editing; this construction risks embedding LLM-specific artifacts (limited topic drift, artificial consistency, or event-chain regularities) that may not match organic human dialogue. Without additional validation—such as side-by-side comparison of model performance on LoCoMo versus naturally occurring long-term dialogues—the generalizability of the lag finding remains uncertain.
- [Experimental Results] The experimental results section asserts that long-context LLMs and RAG still lag humans but supplies no concrete quantitative metrics (accuracy, F1, ROUGE, or statistical tests), model specifications (exact context windows, RAG hyperparameters, retrieval settings), or error analysis broken down by task or temporal distance. These omissions make it impossible to judge the size of the performance gap or to reproduce the central claim.
- [Benchmark Tasks] For the multi-modal dialogue generation task, the manuscript does not specify how image context is provided to models versus human evaluators or how image-grounding is scored. This detail is load-bearing for the fairness of the human-model comparison and for claims about multi-modal long-term memory.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two headline quantitative results (e.g., best-model vs. human scores on the QA task) to convey the magnitude of the reported gap.
- [Dataset Construction] Clarify the exact number of human annotators, inter-annotator agreement, and editing guidelines used in the verification stage to allow readers to assess the reliability of the final dataset.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about experimental details, task specifications, and dataset construction. Our point-by-point responses follow.
read point-by-point responses
-
Referee: The headline claim that LLMs 'substantially lag behind humans' in long-range temporal and causal dynamics rests on LoCoMo being representative of real very long-term conversations. The machine-human pipeline generates dialogues from LLM agents on synthetic event graphs before human editing; this construction risks embedding LLM-specific artifacts (limited topic drift, artificial consistency, or event-chain regularities) that may not match organic human dialogue. Without additional validation—such as side-by-side comparison of model performance on LoCoMo versus naturally occurring long-term dialogues—the generalizability of the lag finding remains uncertain.
Authors: We agree that representativeness is essential for the strength of our claims. The human verification and editing phase was specifically designed to enforce long-range consistency and grounding to the temporal event graphs, which substantially reduces LLM-induced artifacts such as unnatural consistency or limited topic drift. In the revision, we have expanded the pipeline description with additional details on annotator guidelines, the distribution of edit types (e.g., temporal corrections, persona consistency fixes), and inter-annotator agreement statistics. We also explicitly discuss the limitation that no public naturally occurring very long-term open-domain dialogue datasets exist for direct comparison, which is precisely why we developed this resource; this is now stated in the limitations section. revision: partial
-
Referee: The experimental results section asserts that long-context LLMs and RAG still lag humans but supplies no concrete quantitative metrics (accuracy, F1, ROUGE, or statistical tests), model specifications (exact context windows, RAG hyperparameters, retrieval settings), or error analysis broken down by task or temporal distance. These omissions make it impossible to judge the size of the performance gap or to reproduce the central claim.
Authors: We apologize for these omissions in the original submission. The revised manuscript now reports full quantitative results, including accuracy and F1 for question answering, ROUGE and BERTScore for event summarization, and task-specific metrics for multi-modal generation. All model specifications are provided (context window sizes, exact RAG parameters including chunk size, top-k, and retriever), along with statistical significance tests. A new error analysis subsection breaks down failures by task and by temporal distance (short-range vs. long-range events), with examples. revision: yes
-
Referee: For the multi-modal dialogue generation task, the manuscript does not specify how image context is provided to models versus human evaluators or how image-grounding is scored. This detail is load-bearing for the fairness of the human-model comparison and for claims about multi-modal long-term memory.
Authors: We have added a dedicated paragraph in the benchmark tasks section clarifying the multi-modal protocol. Models receive image context either as generated textual captions or direct vision-language model input (depending on the model type), while human evaluators see the original images. Image-grounding is evaluated via a hybrid approach: human ratings on relevance and conversational consistency, plus automated CLIP similarity between generated text and image content. These details are now explicit to support fair comparison. revision: yes
Circularity Check
No circularity: purely empirical benchmark with new data and no derivations
full rationale
The paper creates LoCoMo via an LLM-agent pipeline grounded in personas and event graphs, followed by human verification and editing, then runs standard QA/summarization/generation benchmarks comparing models to humans. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. The central claim (models lag humans on long-range temporal/causal tasks) rests on fresh experimental results rather than any self-referential reduction. This is self-contained empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based agents can generate high-quality, long-range consistent dialogues when grounded on personas and temporal event graphs
Forward citations
Cited by 32 Pith papers
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
-
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
MEMTIER delivers 38% accuracy on the 500-question LongMemEval-S benchmark with a 7B model on 6GB GPU, a 33-point gain over full-context baselines, via structured episodic memory, five-signal retrieval, and semantic co...
-
MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents
MemORAI combines selective filtering, provenance tracking in multi-relational graphs, and dynamic weighted PageRank retrieval to achieve state-of-the-art memory retrieval and personalized responses in LLM agents on LO...
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
EviMem improves accuracy on temporal and multi-hop questions in long-term conversational memory by iteratively diagnosing and filling evidence gaps, achieving 81.6% and 85.2% judge accuracy on LoCoMo at 4.5x lower lat...
-
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
-
Stateless Decision Memory for Enterprise AI Agents
Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
-
To Know is to Construct: Schema-Constrained Generation for Agent Memory
SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms ret...
-
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
-
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying
ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.
-
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
-
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
-
A-MEM: Agentic Memory for LLM Agents
A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.
-
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
-
EgoSelf: From Memory to Personalized Egocentric Assistant
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
-
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
Layered mutability framework claims governance difficulty in persistent self-modifying agents rises with rapid mutation, strong downstream coupling, weak reversibility, and low observability, producing compositional d...
-
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
Persistent self-modifying AI agents exhibit compositional drift from mismatches across five mutability layers, with governance difficulty rising under rapid mutation, strong coupling, weak reversibility, and low obser...
-
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...
-
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker applies multi-teacher pairwise distillation, BCE pointwise training, and InfoNCE contrastive learning on mixed general and memory-specific dialogue data to produce efficient rerankers that improve calibrat...
Reference graph
Works this paper leans on
- [1]
-
[2]
Publications Manual , year = "1983", publisher =
work page 1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [4]
-
[5]
Dan Gusfield , title =. 1997
work page 1997
-
[6]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
Synthetic QA Corpora Generation with Roundtrip Consistency , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[9]
Transactions of the Association for Computational Linguistics , volume=
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them , author=. Transactions of the Association for Computational Linguistics , volume=
-
[10]
DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=
-
[11]
Beyond Goldfish Memory: Long-Term Open-Domain Conversation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[12]
arXiv preprint arXiv:2001.09977 , year=
Towards a human-like open-domain chatbot , author=. arXiv preprint arXiv:2001.09977 , year=
-
[13]
A Persona-Based Neural Conversation Model , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[14]
Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages=
Data-driven response generation in social media , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages=
-
[15]
A stochastic model of human-machine interaction for learning dialog strategies , journal=. 2000 , publisher=
work page 2000
-
[16]
Computer Speech & Language , volume=
Trainable approaches to surface natural language generation and their application to conversational dialog systems , author=. Computer Speech & Language , volume=. 2002 , publisher=
work page 2002
-
[18]
International Conference on Learning Representations , year=
Wizard of Wikipedia: Knowledge-Powered Conversational Agents , author=. International Conference on Learning Representations , year=
-
[21]
Proceedings of the 31st ACM International Conference on Multimedia , pages=
TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=
-
[23]
Time-Stamped Language Model: Teaching Language Models to Understand The Flow of Events , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[25]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
arXiv preprint arXiv:2310.14804 , year=
Large Language Models can Share Images, Too! , author=. arXiv preprint arXiv:2310.14804 , year=
-
[28]
The Twelfth International Conference on Learning Representations , year=
BooookScore: A systematic exploration of book-length summarization in the era of LLMs , author=. The Twelfth International Conference on Learning Representations , year=
-
[30]
Transactions of the Association for Computational Linguistics , volume=
Time-aware language models as temporal knowledge bases , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
work page 2022
-
[31]
Open-Domain Question Answering Goes Conversational via Question Rewriting , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[38]
Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
Hierarchical Transformers Are More Efficient Language Models , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
work page 2022
-
[39]
Advances in Neural Information Processing Systems , volume=
Unlimiformer: Long-range transformers with unlimited length input , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
arXiv preprint arXiv:2310.05029 , year=
Walking down the memory maze: Beyond context limit through interactive reading , author=. arXiv preprint arXiv:2310.05029 , year=
-
[41]
arXiv preprint arXiv:2305.17493
Model Dementia: Generated Data Makes Models Forget , author=. arXiv preprint arXiv:2305.17493 , year=
-
[45]
Collective memory and cultural identity , author=. New german critique , number=. 1995 , publisher=
work page 1995
-
[46]
Towards a psychology of collective memory , author=. Memory , volume=. 2008 , publisher=
work page 2008
-
[47]
Trends in cognitive sciences , volume=
Collective memory from a psychological perspective , author=. Trends in cognitive sciences , volume=. 2018 , publisher=
work page 2018
-
[48]
Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage
Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage , author=. arXiv preprint arXiv:2208.03188 , year=
-
[52]
Proceedings of the 2003 conference on Designing for user experiences , pages=
Personas: practice and theory , author=. Proceedings of the 2003 conference on Designing for user experiences , pages=
work page 2003
- [53]
-
[55]
Computational Linguistics , volume=
The design and implementation of xiaoice, an empathetic social chatbot , author=. Computational Linguistics , volume=. 2020 , publisher=
work page 2020
-
[57]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
DialogCC: An Automated Pipeline for Creating High-Quality Multi-modal Dialogue Datasets , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
work page 2023
-
[58]
Proceedings of the IEEE international conference on computer vision , pages=
Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[59]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Visual dialog , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[61]
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation
Mostafazadeh, Nasrin and Brockett, Chris and Dolan, Bill and Galley, Michel and Gao, Jianfeng and Spithourakis, Georgios and Vanderwende, Lucy. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2017
work page 2017
-
[64]
MMC hat: Multi-Modal Chat Dataset on Social Media
Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian. MMC hat: Multi-Modal Chat Dataset on Social Media. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022
work page 2022
-
[66]
A ug ESC : Dialogue Augmentation with Large Language Models for Emotional Support Conversation
Zheng, Chujie and Sabour, Sahand and Wen, Jiaxin and Zhang, Zheng and Huang, Minlie. A ug ESC : Dialogue Augmentation with Large Language Models for Emotional Support Conversation. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.99
-
[69]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
-
[70]
I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[71]
Advances in Neural Information Processing Systems , volume=
Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
Advances in Neural Information Processing Systems , volume=
A simple language model for task-oriented dialogue , author=. Advances in Neural Information Processing Systems , volume=
-
[74]
DynaEval: Unifying Turn and Dialogue Level Evaluation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[75]
Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue , pages=
Leveraging Large Language Models for Automated Dialogue Analysis , author=. Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue , pages=
-
[76]
International Conference on Learning Representations , year=
BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
-
[77]
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[78]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2022
-
[83]
Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
work page 2022
-
[84]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , booktitle=
work page 2023
-
[85]
arXiv preprint arXiv:2401.05654 , year=
Towards Conversational Diagnostic AI , author=. arXiv preprint arXiv:2401.05654 , year=
-
[86]
2022 IEEE International Conference on Data Mining Workshops (ICDMW) , pages=
Persona-Based Conversational AI: State of the Art and Challenges , author=. 2022 IEEE International Conference on Data Mining Workshops (ICDMW) , pages=. 2022 , organization=
work page 2022
-
[87]
Keep Me Updated! Memory Management in Long-term Conversations
Bae, Sanghwan and Kwak, Donghyun and Kang, Soyoung and Lee, Min Young and Kim, Sungdong and Jeong, Yuin and Kim, Hyeri and Lee, Sang-Woo and Park, Woomyoung and Sung, Nako. Keep Me Updated! Memory Management in Long-term Conversations. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.276
-
[88]
arXiv preprint arXiv:2308.15022 , year=
Recursively summarizing enables long-term dialogue memory in large language models , author=. arXiv preprint arXiv:2308.15022 , year=
-
[89]
Memory sandbox: Transparent and interactive memory management for conversational agents , author=. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
-
[90]
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , author=. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[95]
Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
Retrieval Augmentation Reduces Hallucination in Conversation , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
work page 2021
-
[97]
Annual review of psychology , volume=
Remembering in conversations: The social sharing and reshaping of memories , author=. Annual review of psychology , volume=. 2012 , publisher=
work page 2012
-
[98]
Advances in Neural Information Processing Systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[99]
The Twelfth International Conference on Learning Representations , year=
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[100]
Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
work page 2022
-
[101]
EmailSum: Abstractive Email Thread Summarization , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[102]
SummScreen: A Dataset for Abstractive Screenplay Summarization , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[103]
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=
-
[104]
Jaewoo Ahn, Yeda Song, Sangdoo Yun, and Gunhee Kim. 2023. https://doi.org/10.18653/v1/2023.acl-long.189 MPCHAT : Towards multimodal persona-grounded conversation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3354--3377, Toronto, Canada. Association for Computational Linguistics
-
[105]
Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520--534
work page 2021
-
[106]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425--2433
work page 2015
-
[107]
Jan Assmann and John Czaplicka. 1995. Collective memory and cultural identity. New german critique, (65):125--133
work page 1995
-
[108]
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. 2024. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36
work page 2024
-
[109]
Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2023. Booookscore: A systematic exploration of book-length summarization in the era of llms. In The Twelfth International Conference on Learning Representations
work page 2023
-
[110]
Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. 2022. Summscreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602--8615
work page 2022
-
[111]
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations
work page 2023
-
[112]
Alan Cooper. 1999. The inmates are running the asylum. Springer
work page 1999
-
[113]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359
work page 2022
-
[114]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos \'e MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326--335
work page 2017
-
[115]
Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. 2023. https://doi.org/10.18653/v1/2023.acl-long.405 MMD ialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:...
-
[116]
Silin Gao, Beatriz Borges, Soyoung Oh, Deniz Bayazit, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, and Antoine Bosselut. 2023 a . https://doi.org/10.18653/v1/2023.acl-long.362 P ea C o K : Persona commonsense knowledge for consistent and engaging narratives . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume ...
-
[117]
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-main.398 Enabling large language models to generate text with citations . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465--6488, Singapore. Association for Computational Linguistics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.