ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Proactiveeval: A unified evaluation framework for proactive dialogue agents
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
TriggerBench is a new benchmark showing prospective memory in LLMs is harder than retrospective memory, exhibits precision-recall trade-offs, and may indicate spare reasoning capacity.
A temporal-graph model on structured event streams replaces per-event LLM calls for trigger decisions in proactive agents, reporting mean F1 gains of 16.7 and 4-83x speedups.
citing papers explorer
-
TriggerBench: Investigating Prospective Memory for Large Language Models
TriggerBench is a new benchmark showing prospective memory in LLMs is harder than retrospective memory, exhibits precision-recall trade-offs, and may indicate spare reasoning capacity.
-
Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?
A temporal-graph model on structured event streams replaces per-event LLM calls for trigger decisions in proactive agents, reporting mean F1 gains of 16.7 and 4-83x speedups.