Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
A Critical Evaluation of Evaluations for Long-form Question Answering
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.CL 3representative citing papers
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
citing papers explorer
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
- Lessons from the Trenches on Reproducible Evaluation of Language Models