LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
One thousand and one pairs: A "novel" challenge for long-context language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLM novel summaries emphasize endings more than human ones, measured by aligning summary sentences to referenced chapters.
citing papers explorer
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries
LLM novel summaries emphasize endings more than human ones, measured by aligning summary sentences to referenced chapters.