Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

Haoyi Hu; Jianghao Lin; Qirong Lyu; Weinan Zhang; Weiwen Liu; Xianghan Kong; Yan Xu; Yasheng Wang; Yong Yu; Zixuan Guo

arxiv: 2605.25971 · v2 · pith:WAYX6VOCnew · submitted 2026-05-25 · 💻 cs.CL · cs.IR· cs.MA

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

Haoyi Hu , Qirong Lyu , Xianghan Kong , Weiwen Liu , Jianghao Lin , Zixuan Guo , Yan Xu , Yasheng Wang

show 2 more authors

Weinan Zhang Yong Yu

This is my paper

Pith reviewed 2026-06-29 21:42 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.MA

keywords proactive agentsidle-time computeanticipationdialogue historypersistent memorytask completionhallucination reductionbenchmark evaluation

0 comments

The pith

ProAct lets agents use idle time to anticipate user needs from dialogue history and memory, cutting turns by 14.8% and hallucinations by 28.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agents can stop waiting for explicit prompts and instead use time between interactions to predict what users will ask next. It does this by reading the current conversation thread plus stored memory to forecast needs, then gathering facts and evidence in advance so the gaps are already closed when the query arrives. If correct, this changes agents from responders into preparers that finish work in fewer exchanges and with fewer errors. The claim rests on results from a new 200-scenario benchmark covering 40 domains where the proactive version beats standard reactive agents on speed, effort, and accuracy.

Core claim

ProAct is a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs by analyzing evolving dialogue history together with persistent memory, predicting upcoming needs and iteratively acquiring information to resolve knowledge gaps and prepare evidence before the user initiates a query.

What carries the argument

The ProAct architecture, which predicts and prepares for future needs during idle time using dialogue history and persistent memory.

If this is right

Task completion accelerates by reducing required turns by 14.8%.
User effort decreases by 11.7%.
Hallucination rates drop by 28.1%.
Reflective accuracy reaches state-of-the-art levels on MemBench evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent interfaces could default to always-on preparation rather than prompt-triggered responses.
The same idle-time prediction loop could be tested in multi-agent or tool-heavy environments where preparation involves coordination across systems.
Benchmarks built around predictable need chains may need companion tests for sudden or context-shifting user goals.

Load-bearing premise

User needs form predictable chains that can be reliably inferred from evolving dialogue history plus persistent memory.

What would settle it

Running the same 200 scenarios but with deliberately unpredictable user needs that break the chain pattern, then measuring whether the turn count, effort, and hallucination advantages disappear.

read the original abstract

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProAct shows measurable gains on a benchmark built around predictable need chains, but that same design choice limits how far the results generalize.

read the letter

The paper's central move is to treat the time between user turns as usable compute. ProAct analyzes dialogue history plus memory to guess the next need, then gathers info or evidence ahead of time. They pair this with ProActEval, a new set of 200 scenarios across 40 domains that explicitly include predictable need chains and different user cognitive profiles. On that set the numbers are 14.8% fewer turns, 11.7% less user effort, and 28.1% lower hallucination rate versus reactive baselines, plus a MemBench win on reflective accuracy.

The architecture and the benchmark are the actual additions. Framing idle time as a resource rather than dead space is a clean shift from the usual reactive agent setup, and the benchmark gives a concrete way to measure anticipation.

The main limitation is the one flagged in the stress test. Because the scenarios are constructed to contain the kind of inferable chains the method is built to exploit, the reported improvements are tied to that structure. Nothing in the abstract or the supplied description shows a check against real logged user traces or against needs that are more stochastic. Without that, it is hard to know whether the gains shrink or disappear outside the test distribution. The abstract also reports raw percentages with no error bars, significance tests, or baseline implementation details, which leaves the strength of the empirical claim unclear until the full methods section is examined.

This work is aimed at researchers building tool-using agents who are already thinking about multi-turn interaction and memory. A reader in that subfield can extract the idle-time mechanism and the benchmark design even if they later test on different data.

The paper is coherent on its own terms and engages the literature it cites, so it is worth sending to referees. They will need to press on generalization and on the statistical reporting, but the core idea is worth the time.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and prepare for upcoming user needs by analyzing evolving dialogue history together with persistent memory. It also presents ProActEval, a benchmark consisting of 200 scenarios across 40 domains that feature predictable need chains and diverse user cognitive profiles. The central empirical claim is that ProAct outperforms reactive baselines by reducing required turns by 14.8%, decreasing user effort by 11.7%, and cutting hallucination rates by 28.1% on ProActEval, while additionally achieving state-of-the-art reflective accuracy on MemBench.

Significance. If the empirical results prove robust beyond the specific benchmark construction, the work could meaningfully advance agent design by demonstrating the value of proactive idle-time computation, potentially leading to more efficient and less error-prone interactive systems. The introduction of a dedicated benchmark for proactive capabilities is a constructive step for the field.

major comments (2)

[ProActEval benchmark description] ProActEval benchmark description: The benchmark is explicitly constructed around 'predictable need chains' that align directly with the anticipation mechanism in ProAct. No independent validation against logged real-user traces or assessment of how representative the 200 scenarios are of stochastic needs is described, which places the reported reductions (14.8% turns, 11.7% effort, 28.1% hallucinations) at risk of being benchmark-specific rather than general. This is load-bearing for the central empirical claim.
[Empirical results section] Empirical results section: The abstract and results report percentage improvements without details on statistical significance testing, error bars, baseline implementation, data exclusion rules, or variance across domains and cognitive profiles. This omission prevents verification of whether the advantages are reliably supported by the experiments.

minor comments (1)

[Abstract] The abstract would benefit from a short clause noting the benchmark's focus on predictable need chains to appropriately contextualize the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [ProActEval benchmark description] The benchmark is explicitly constructed around 'predictable need chains' that align directly with the anticipation mechanism in ProAct. No independent validation against logged real-user traces or assessment of how representative the 200 scenarios are of stochastic needs is described, which places the reported reductions (14.8% turns, 11.7% effort, 28.1% hallucinations) at risk of being benchmark-specific rather than general. This is load-bearing for the central empirical claim.

Authors: We agree that ProActEval is intentionally designed around predictable need chains to isolate and evaluate the proactive anticipation capabilities of ProAct in a controlled setting. This design choice enables clear measurement of the benefits of idle-time computation without confounding factors from unpredictable user behavior. While we do not have access to proprietary real-user logs for validation, we will revise the manuscript to include a more detailed discussion of the benchmark construction process, drawing from cognitive science literature on predictable need chains, and explicitly acknowledge the limitation that the results may be most applicable to scenarios with foreseeable needs. We will also add analysis of variance across the 40 domains to demonstrate robustness within the benchmark. revision: partial
Referee: [Empirical results section] The abstract and results report percentage improvements without details on statistical significance testing, error bars, baseline implementation, data exclusion rules, or variance across domains and cognitive profiles. This omission prevents verification of whether the advantages are reliably supported by the experiments.

Authors: We appreciate this observation and acknowledge that the original submission lacked sufficient methodological details for full reproducibility and verification. In the revised manuscript, we will expand the Empirical Results section to include: statistical significance testing with p-values from appropriate tests (e.g., Wilcoxon signed-rank test for paired comparisons), error bars representing standard error across runs or domains, complete descriptions of baseline implementations including any prompt engineering or model versions used, confirmation that no data points were excluded, and breakdowns of performance variance across domains and user cognitive profiles. These additions will be supported by updated tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are direct measurements on an explicitly constructed benchmark

full rationale

The paper introduces an architecture (ProAct) and a new benchmark (ProActEval) with 200 scenarios featuring predictable need chains, then reports direct empirical comparisons (turn reduction, effort, hallucination rates) against reactive baselines. No equations, parameter fitting, derivation chains, or self-citations appear in the provided text that would reduce any claimed result to its own inputs by construction. The benchmark design is stated upfront rather than hidden, and the metrics are presented as observed outcomes rather than quantities defined in terms of the model's outputs. This is a standard empirical setup with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that user needs are sufficiently predictable from dialogue history and memory to justify pre-computation; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption User needs form predictable chains that can be inferred from evolving dialogue history together with persistent memory.
Invoked to justify the proactive prediction step and the design of ProActEval scenarios.

invented entities (2)

ProAct architecture no independent evidence
purpose: Proactive agent that anticipates needs during idle time
New system introduced to perform the anticipation and preparation; no independent falsifiable evidence supplied beyond the reported benchmark scores.
ProActEval benchmark no independent evidence
purpose: Evaluation suite with 200 scenarios across 40 domains and diverse cognitive profiles
New test set created to measure proactive capabilities; no external validation of scenario realism is described.

pith-pipeline@v0.9.1-grok · 5763 in / 1467 out tokens · 29554 ms · 2026-06-29T21:42:28.018532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Thomas De Min, Subhankar Roy, St \'e phane Lathuili \`e re, Elisa Ricci, and Massimiliano Mancini. Proactivebench: Benchmarking proactiveness in multimodal large language models. arXiv preprint arXiv:2603.19466, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Proactive coping and preventive coping: Evidence for two distinct constructs

Suzie Drummond and Paula Brough. Proactive coping and preventive coping: Evidence for two distinct constructs. Personality and Individual Differences, 92: 0 123--127, 2016

2016
[5]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 152--164, 2024

2024
[6]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 1, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

The proactive coping inventory (pci): A multidimensional research instrument

Esther Greenglass. The proactive coping inventory (pci): A multidimensional research instrument. In International Conference of, 1999

1999
[8]

Metareflection: Learning instructions for language agents using past reflections

Priyanshu Gupta, Shashank Kirtania, Ananya Singha, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, and Sherry Shi. Metareflection: Learning instructions for language agents using past reflections. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8369--8385, 2024

2024
[9]

Designing the conversational agent: asking follow-up questions for information elicitation

Jiaxiong Hu, Jingya Guo, Ningjing Tang, Xiaojuan Ma, Yuan Yao, Changyuan Yang, and Yingqing Xu. Designing the conversational agent: asking follow-up questions for information elicitation. Proceedings of the ACM on Human-Computer Interaction, 8 0 (CSCW1): 0 1--30, 2024

2024
[11]

Proactive conversational agents in the post-chatgpt world

Lizi Liao, Grace Hui Yang, and Chirag Shah. Proactive conversational agents in the post-chatgpt world. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 3452--3455, 2023

2023
[12]

Sleep-time compute: Beyond inference scaling at test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E Gonzalez. Sleep-time compute: Beyond inference scaling at test-time. arXiv preprint arXiv:2504.13171, 2025

work page arXiv 2025
[13]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[14]

Proactive agent: Shifting llm agents from reactive responses to active assistance

Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, et al. Proactive agent: Shifting llm agents from reactive responses to active assistance. arXiv preprint arXiv:2410.12361, 2024

work page arXiv 2024
[15]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023
[16]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22, 2023

2023
[17]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36: 0 8634--8652, 2023

2023
[18]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19336--19352, 2025

2025
[20]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, and Kam-Fai Wong. Toward a theory of agents as tool-use decision-makers. arXiv preprint arXiv:2506.00886, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 0 (6): 0 186345, 2024

2024
[24]

General agentic memory via deep research

BY Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research. arXiv preprint arXiv:2511.18423, 2025

work page arXiv 2025
[25]

Lightweight LLM Agent Memory with Small Language Models

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Zhenzhen Huang, Pengcheng Zheng, Zhicheng Wang, Ping Guo, Fan Mo, Sung-Ho Bae, Jie Zou, et al. Lightweight llm agent memory with small language models. arXiv preprint arXiv:2604.07798, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724--19731, 2024

2024
[28]

S., O'Brien, J

Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In UIST, 2023

2023
[29]

MemGPT: Towards LLMs as Operating Systems

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

MemoryBank: Enhancing large language models with long-term memory

Zhong, W., Guo, L., Gao, Q., and Wang, Y. MemoryBank: Enhancing large language models with long-term memory. In AAAI, 2024

2024
[31]

Enhancing large language model with self-controlled memory framework

Wang, B., Liang, X., Yang, J., Huang, H., Wu, S., Wu, P., Lu, L., Ma, Z., and Li, Z. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023
[32]

Proactive computing: Foundations and implementations

Tennenholtz, G., Hick, R., and Mannor, S. Proactive computing: Foundations and implementations. ACM Computing Surveys, 2023

2023
[33]

Proactive dialogue systems: A survey

Deng, Y., Zhang, W., Chen, Z., and Gu, Q. Proactive dialogue systems: A survey. arXiv preprint arXiv:2305.02750, 2023

work page arXiv 2023
[34]

Reflexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023
[35]

Collaborative filtering for conversational recommendation

Su, Y., Yang, D., Ostendorf, M., and Hovy, E. Collaborative filtering for conversational recommendation. In ACL, 2019

2019
[36]

LaMP: When large language models meet personalization

Salemi, A., Mysore, S., Bendersky, M., and Zamani, H. LaMP: When large language models meet personalization. In NAACL, 2024

2024
[37]

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Tan, H., Zhang, Z., Ma, C., Chen, X., Dai, Q., and Dong, Z. MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19336--19352, 2025. doi:10.18653/v1/2025.findings-acl.989

work page doi:10.18653/v1/2025.findings-acl.989 2025
[38]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Wu, Y., Xie, T., Jiao, W., Ye, Z., Chen, J., Li, T., and Wen, Z. LongMemEval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

MemSim: A Bayesian simulator for evaluating memory of LLM-based personal assistants

Zhang, Z., Bo, L., Xiao, C., Chen, H., and Chen, H. MemSim: A Bayesian simulator for evaluating memory of LLM-based personal assistants. arXiv preprint arXiv:2409.20163, 2024

work page arXiv 2024
[40]

PerLTQA: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering

Du, Z., Chen, Q., Jia, Y., Chen, X., Xie, R., Ji, Z., and Sun, M. PerLTQA: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering. arXiv preprint arXiv:2402.16288, 2024

work page arXiv 2024
[41]

Dialsim: A dialogue simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

Kim, J., Lee, J., Yoo, K. M., and Kang, J. DialSim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents. arXiv preprint arXiv:2406.13144, 2024

work page arXiv 2024

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Thomas De Min, Subhankar Roy, St \'e phane Lathuili \`e re, Elisa Ricci, and Massimiliano Mancini. Proactivebench: Benchmarking proactiveness in multimodal large language models. arXiv preprint arXiv:2603.19466, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [4]

Proactive coping and preventive coping: Evidence for two distinct constructs

Suzie Drummond and Paula Brough. Proactive coping and preventive coping: Evidence for two distinct constructs. Personality and Individual Differences, 92: 0 123--127, 2016

2016

[4] [5]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 152--164, 2024

2024

[5] [6]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 1, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [7]

The proactive coping inventory (pci): A multidimensional research instrument

Esther Greenglass. The proactive coping inventory (pci): A multidimensional research instrument. In International Conference of, 1999

1999

[7] [8]

Metareflection: Learning instructions for language agents using past reflections

Priyanshu Gupta, Shashank Kirtania, Ananya Singha, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, and Sherry Shi. Metareflection: Learning instructions for language agents using past reflections. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8369--8385, 2024

2024

[8] [9]

Designing the conversational agent: asking follow-up questions for information elicitation

Jiaxiong Hu, Jingya Guo, Ningjing Tang, Xiaojuan Ma, Yuan Yao, Changyuan Yang, and Yingqing Xu. Designing the conversational agent: asking follow-up questions for information elicitation. Proceedings of the ACM on Human-Computer Interaction, 8 0 (CSCW1): 0 1--30, 2024

2024

[9] [11]

Proactive conversational agents in the post-chatgpt world

Lizi Liao, Grace Hui Yang, and Chirag Shah. Proactive conversational agents in the post-chatgpt world. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 3452--3455, 2023

2023

[10] [12]

Sleep-time compute: Beyond inference scaling at test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E Gonzalez. Sleep-time compute: Beyond inference scaling at test-time. arXiv preprint arXiv:2504.13171, 2025

work page arXiv 2025

[11] [13]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024

[12] [14]

Proactive agent: Shifting llm agents from reactive responses to active assistance

Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, et al. Proactive agent: Shifting llm agents from reactive responses to active assistance. arXiv preprint arXiv:2410.12361, 2024

work page arXiv 2024

[13] [15]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023

[14] [16]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22, 2023

2023

[15] [17]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36: 0 8634--8652, 2023

2023

[16] [18]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19336--19352, 2025

2025

[17] [20]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [21]

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, and Kam-Fai Wong. Toward a theory of agents as tool-use decision-makers. arXiv preprint arXiv:2506.00886, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [22]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 0 (6): 0 186345, 2024

2024

[20] [24]

General agentic memory via deep research

BY Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research. arXiv preprint arXiv:2511.18423, 2025

work page arXiv 2025

[21] [25]

Lightweight LLM Agent Memory with Small Language Models

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Zhenzhen Huang, Pengcheng Zheng, Zhicheng Wang, Ping Guo, Fan Mo, Sung-Ho Bae, Jie Zou, et al. Lightweight llm agent memory with small language models. arXiv preprint arXiv:2604.07798, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [27]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724--19731, 2024

2024

[23] [28]

S., O'Brien, J

Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In UIST, 2023

2023

[24] [29]

MemGPT: Towards LLMs as Operating Systems

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [30]

MemoryBank: Enhancing large language models with long-term memory

Zhong, W., Guo, L., Gao, Q., and Wang, Y. MemoryBank: Enhancing large language models with long-term memory. In AAAI, 2024

2024

[26] [31]

Enhancing large language model with self-controlled memory framework

Wang, B., Liang, X., Yang, J., Huang, H., Wu, S., Wu, P., Lu, L., Ma, Z., and Li, Z. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023

[27] [32]

Proactive computing: Foundations and implementations

Tennenholtz, G., Hick, R., and Mannor, S. Proactive computing: Foundations and implementations. ACM Computing Surveys, 2023

2023

[28] [33]

Proactive dialogue systems: A survey

Deng, Y., Zhang, W., Chen, Z., and Gu, Q. Proactive dialogue systems: A survey. arXiv preprint arXiv:2305.02750, 2023

work page arXiv 2023

[29] [34]

Reflexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023

[30] [35]

Collaborative filtering for conversational recommendation

Su, Y., Yang, D., Ostendorf, M., and Hovy, E. Collaborative filtering for conversational recommendation. In ACL, 2019

2019

[31] [36]

LaMP: When large language models meet personalization

Salemi, A., Mysore, S., Bendersky, M., and Zamani, H. LaMP: When large language models meet personalization. In NAACL, 2024

2024

[32] [37]

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Tan, H., Zhang, Z., Ma, C., Chen, X., Dai, Q., and Dong, Z. MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19336--19352, 2025. doi:10.18653/v1/2025.findings-acl.989

work page doi:10.18653/v1/2025.findings-acl.989 2025

[33] [38]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Wu, Y., Xie, T., Jiao, W., Ye, Z., Chen, J., Li, T., and Wen, Z. LongMemEval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [39]

MemSim: A Bayesian simulator for evaluating memory of LLM-based personal assistants

Zhang, Z., Bo, L., Xiao, C., Chen, H., and Chen, H. MemSim: A Bayesian simulator for evaluating memory of LLM-based personal assistants. arXiv preprint arXiv:2409.20163, 2024

work page arXiv 2024

[35] [40]

PerLTQA: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering

Du, Z., Chen, Q., Jia, Y., Chen, X., Xie, R., Ji, Z., and Sun, M. PerLTQA: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering. arXiv preprint arXiv:2402.16288, 2024

work page arXiv 2024

[36] [41]

Dialsim: A dialogue simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

Kim, J., Lee, J., Yoo, K. M., and Kang, J. DialSim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents. arXiv preprint arXiv:2406.13144, 2024

work page arXiv 2024