AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Hongjin Qian; Jiongnan Liu; Shuting Wang; Tong Zhao; Xiaoxi Li; Yuyang Hu; Zheng Liu; Zhicheng Dou

arxiv: 2605.24486 · v1 · pith:5WXE47AOnew · submitted 2026-05-23 · 💻 cs.AI · cs.CL

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Yuyang Hu , Hongjin Qian , Shuting Wang , Jiongnan Liu , Tong Zhao , Xiaoxi Li , Zheng Liu , Zhicheng Dou This is my paper

Pith reviewed 2026-06-30 13:26 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multi-agent systemscollective reasoninglong-horizon tasksagent scalingshared reasoning hubreinforcement learningparallel exploration

0 comments

The pith

A shared reasoning hub lets multiple peer agents turn parallel explorations into reusable intermediate results for long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether running several peer agents on the same long-horizon task can yield capability gains through collective reasoning instead of simply spending more compute. It introduces AgentFugue, a framework with a shared reasoning hub that records concise notes on what each agent has established, attempted, or ruled out and lets agents selectively retrieve useful parts from others. This design connects otherwise isolated trajectories without requiring role specialization or centralized orchestration. The hub is implemented as a plug-in layer trained via supervised fine-tuning and reinforcement learning. Experiments across challenging settings show improvements over strong baselines, indicating that scaling out agents can become a distinct source of performance rather than just additional resource use.

Core claim

AgentFugue is a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. The hub is instantiated as a plug-in communication layer trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings studied, AgentFugue improves over st

What carries the argument

The shared reasoning hub, a plug-in layer that records concise notes from each peer agent and surfaces them for selective access by others.

If this is right

Peer agents can access and build on non-redundant intermediate results from one another without explicit role assignments.
Scaling the number of peer agents becomes a source of capability improvement distinct from scaling individual model size or tools.
The approach functions as an add-on layer that does not require redesigning the underlying agent scaffolding or workflow.
Training the hub with a mix of supervised fine-tuning and reinforcement learning suffices to produce useful note recording and retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hub design could be tested on tasks outside the studied long-horizon settings to see whether the collective benefit generalizes.
Varying the number of peer agents while measuring the marginal gain per added agent would clarify how the sharing mechanism scales.
If the hub can be made to work across different base agents, it would support treating multi-agent systems as reusable ecologies rather than one-off ensembles.

Load-bearing premise

That performance gains arise from the collective sharing mechanism itself rather than from simply running more agents and therefore using more total compute.

What would settle it

A controlled comparison that holds total compute fixed while removing the hub's note-recording and sharing function, then checks whether the multi-agent performance advantage disappears.

Figures

Figures reproduced from arXiv: 2605.24486 by Hongjin Qian, Jiongnan Liu, Shuting Wang, Tong Zhao, Xiaoxi Li, Yuyang Hu, Zheng Liu, Zhicheng Dou.

**Figure 1.** Figure 1: Overview of AgentFugue. The top panel illustrates the core idea: peer agents explore the same task in parallel while a shared reasoning hub mediates cross-trajectory communication. The bottom panel details the reasoning hub, including episode writing, context eviction, and intent-driven reading. Once an agent’s current interaction reaches the write budget, it is summarized into an episode note and added to… view at source ↗

**Figure 2.** Figure 2: Homogeneous scaling on BrowseComp (Qwen3.5-35B-A3B, N ∈ {1, 2, 3, 5, 8}). (a) Peragent accuracy and team-mean Avg@N as the team grows. (b) Per-agent search/visit calls (cool) vs. per-question memory calls (warm); larger teams shift effort from isolated exploration to shared coordination. • Consistent dominance over every multi-agent baseline under both backbones. Under Qwen3.5- 35B-A3B, AgentFugue reaches… view at source ↗

**Figure 3.** Figure 3: Aggregator metrics across team sizes (Pass, BoN, MV, WMV, FewTool, Avg). Each spoke is one heuristic rule; bars within a spoke sweep N from light to dark. N=1 N=2 N=3 N=4 Team size (number of heterogeneous peer agents) 40 50 60 70 BrowseComp accuracy (%) 52.5 57 63.8 Qwen DeepSeek GLM Kimi Per-agent coverage (min--max) Avg@N (team-mean per-agent accuracy) (a) Per-model trajectories with team size 60 50 40 … view at source ↗

**Figure 4.** Figure 4: Heterogeneous scaling on BrowseComp (Qwen → +DeepSeek-v4-Flash → +GLM-4.7 → +Kimi-K2.5). (a) Per-model per-agent accuracy as each backbone joins; markers at N=1 are standalone baselines. (b) Per-agent search/visit vs. per-question memory calls. 3.5 Heterogeneous Teams: Stronger Models Pull Up the Group We next ask whether the same hub generalizes to teams of different backbones. Starting from Qwen3.5-35B-A… view at source ↗

read the original abstract

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentFugue's shared hub for peer agents to exchange notes on long tasks is a clean architectural idea, but the abstract gives no evidence that gains come from the collective mechanism rather than extra parallel compute.

read the letter

The paper's main contribution is a plug-in shared reasoning hub that multiple peer agents can use to record and selectively pull concise notes on what each has established or ruled out while working the same long-horizon task in parallel. The design deliberately skips role specialization and centralized orchestration, treating the agents as an ecology of reusable intermediate results instead.

This is a straightforward way to frame scaling out versus scaling up, and the choice to train the hub end-to-end with SFT then RL is a reasonable implementation step that keeps the layer lightweight.

The soft spot is exactly the one the stress-test note flags. The abstract states that AgentFugue improves over strong baselines but supplies no numbers, no agent count, no token accounting, and no ablation that runs the same K agents without the hub or with matched total inference budget. Without those controls it is impossible to separate the claimed benefit of collective reasoning from the simple effect of spending more compute on parallel trajectories. The paper does not describe any such matched baseline, so the central attribution remains untested on the evidence given.

The work is aimed at researchers already building multi-agent systems for long tasks. If the full manuscript contains the missing compute-matched ablations and per-trajectory metrics, it is worth sending to referees; otherwise the soundness gap is large enough that a desk reject would be reasonable.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AgentFugue, a collective reasoning framework for scaling out peer agents on long-horizon tasks. Peer agents run in parallel on the same task; a shared reasoning hub records concise notes on what each has established, attempted, or ruled out and enables selective access to non-redundant intermediates. The hub is instantiated as a plug-in layer trained via supervised fine-tuning followed by end-to-end reinforcement learning. The paper claims that this architecture yields improvements over strong baselines in the studied settings and that collective reasoning constitutes a distinct source of capability gains beyond additional compute.

Significance. If the empirical claims hold under proper controls, the work would identify a new, orthogonal axis for agent scaling—collective reuse of intermediate reasoning—distinct from both single-agent model scaling and explicit multi-agent orchestration. This could meaningfully expand the design space for long-horizon agent systems.

major comments (2)

[Abstract] Abstract: the claim that AgentFugue “improves over strong baselines” and that gains arise from collective reasoning rather than increased compute is unsupported by any reported metrics, statistical tests, ablation results, or experimental details. Without these, the central attribution cannot be evaluated.
[Results / Experimental Setup (missing)] No section describes a compute-matched baseline that runs the same number K of peer agents with identical total token budget but without the hub, nor an ablation that disables selective access while holding communication volume constant. These controls are load-bearing for the claim that observed gains are due to the shared reasoning hub rather than extra inference steps.

minor comments (1)

The description of how the hub is trained (SFT then RL) and how selective access is implemented would benefit from an explicit algorithmic outline or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional experimental controls and clarity in the abstract would strengthen the attribution of gains to the shared reasoning hub. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that AgentFugue “improves over strong baselines” and that gains arise from collective reasoning rather than increased compute is unsupported by any reported metrics, statistical tests, ablation results, or experimental details. Without these, the central attribution cannot be evaluated.

Authors: The abstract is a high-level summary of the work. The full manuscript reports quantitative results, including performance metrics across long-horizon tasks and comparisons against baselines, in the Results section. To improve self-containment and address the concern directly, we will revise the abstract to reference the key metrics, note the presence of ablations, and indicate that statistical details appear in the main text. revision: yes
Referee: [Results / Experimental Setup (missing)] No section describes a compute-matched baseline that runs the same number K of peer agents with identical total token budget but without the hub, nor an ablation that disables selective access while holding communication volume constant. These controls are load-bearing for the claim that observed gains are due to the shared reasoning hub rather than extra inference steps.

Authors: We agree these controls are necessary to isolate the contribution of the hub. In the revised version we will add (1) a compute-matched baseline in which K peer agents run without the hub but consume an identical total token budget and (2) an ablation that removes selective access while preserving the same communication volume (e.g., by broadcasting all notes). Both will be reported with the corresponding performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural proposal with empirical claims only

full rationale

The paper describes an architectural framework (shared reasoning hub trained via SFT + RL) and reports empirical improvements over baselines. No equations, fitted parameters presented as predictions, self-citational load-bearing premises, or reductions of claims to inputs by construction appear in the abstract or described content. The central claim is an empirical suggestion about collective reasoning gains, not a derivation that collapses to its own definitions or prior self-work. Per rules, absence of quoted self-referential steps yields score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5758 in / 1104 out tokens · 37594 ms · 2026-06-30T13:26:56.513745+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
cs.CL 2026-06 unverdicted novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

Reference graph

Works this paper leans on

88 extracted references · 31 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Bradley C. A. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.CoRR, abs/2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob P

Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob P. Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton ...

work page arXiv 2026
[3]

Iterresearch: Rethinking long-horizon agents with interaction scaling, 2026

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Iterresearch: Rethinking long-horizon agents with interaction scaling, 2026

2026
[4]

Reconcile: Round-table conference improves reasoning via consensus among diverse llms

Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16,...

2024
[5]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, ICLR 202...

2024
[6]

Mem0: Building production-ready AI agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. In Inês Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors,ECAI 2025 - 28th European Conference on Artificial Intelligence, 25-...

2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, IC...

2024
[9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation.CoRR, abs/2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Agentswing: Adap- tive parallel context management routing for long-horizon web agents.CoRR, abs/2603.27490, 2026

Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Run- nan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Yong Jiang, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, and Jingren Zhou. Agentswing: Adap- tive parallel context management routing for long-horizon web agents.CoRR, abs/2603.27490, 2026

work page arXiv 2026
[11]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024
[12]

Memory matters more: Event-centric memory as a logic map for agent searching and reasoning.CoRR, abs/2601.04726, 2026

Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, and Zhicheng Dou. Memory matters more: Event-centric memory as a logic map for agent searching and reasoning.CoRR, abs/2601.04726, 2026

work page arXiv 2026
[13]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi- agent-based code generation with iterative testing and optimisation.CoRR, abs/2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. In Guodong Long, Michale Blumestein, Yi Chang, Liane Lewin-Eytan, Zi Helen Huang, and Elad Yom-Tov, editors,Companion Proceedings of the ACM on Web Confere...

2025
[17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orl...

2022
[18]

Agentic aggregation for parallel scaling of long-horizon agentic tasks, 2026

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks, 2026

2026
[19]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Parallelmuse: Agentic parallel thinking for deep information seeking.CoRR, abs/2510.24698, 2025

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking.CoRR, abs/2510.24698, 2025

work page arXiv 2025
[21]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- matio...

2023
[22]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. Trans. Mach. Learn. Res., 2024, 2024. 11

2024
[23]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent.CoRR, abs/2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, ...

2025
[25]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.CoRR, abs/2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Encouraging divergent thinking in large language models through multi- agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024,...

2024
[27]

A dynamic llm-powered agent network for task-oriented agent collaboration, 2024

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration, 2024

2024
[28]

Courier Corporation, 1987

Alfred Mann.The Study of Fugue. Courier Corporation, 1987

1987
[29]

Candès, and Tatsunori Hashimoto

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural ...

2025
[30]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback.CoRR, abs/2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2025. Accessed: 2026-05-06

2025
[33]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Sy...

2024
[34]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Sum- mer Yue, Alexandr Wang, and Dan Hendrycks. Humanity’s last exam.CoRR, abs/2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Learning to reason across parallel samples for LLM reasoning.CoRR, abs/2506.09014, 2025

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for LLM reasoning.CoRR, abs/2506.09014, 2025. 12

work page arXiv 2025
[36]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

2024
[37]

Scaling large language model- based multi-agent collaboration

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model- based multi-agent collaboration. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[38]

Memobrain: Executive memory as an agentic brain for reasoning.CoRR, abs/2601.08079, 2026

Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning.CoRR, abs/2601.08079, 2026

work page arXiv 2026
[39]

Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents.CoRR, abs/2509.13309, 2025

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents.CoRR, abs/2509.13309, 2025

work page arXiv 2025
[40]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning...

2024
[41]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems...

2023
[42]

Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proce...

2023
[43]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

2023
[44]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025
[45]

OpenReview.net, 2025

2025
[46]

Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

work page arXiv 2025
[47]

Memsifter: Offloading llm memory retrieval via outcome-driven proxy reasoning, 2026

Jiejun Tan, Zhicheng Dou, Liancheng Zhang, Yuyang Hu, Yiruo Cheng, and Ji-Rong Wen. Memsifter: Offloading llm memory retrieval via outcome-driven proxy reasoning, 2026

2026
[48]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res., 2024, 2024

2024
[52]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[53]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

2023
[54]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria...

2024
[55]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023
[56]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.CoRR, abs/2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

2022
[58]

Widesearch: Benchmarking agentic broad info-seeking.CoRR, abs/2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.CoRR, abs/2508.07999, 2025

work page arXiv 2025
[59]

Webdancer: Towards autonomous information seeking agency.CoRR, abs/2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency.CoRR, abs/2505.22648, 2025

work page arXiv 2025
[60]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework.CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

work page arXiv 2025
[62]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: agentic memory for LLM agents.CoRR, abs/2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Inf...

2023
[64]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023
[65]

OpenReview.net, 2023

2023
[66]

Agentfold: Long-horizon web agents with proactive context management

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management. CoRR, abs/2510.24699, 2025

work page arXiv 2025
[67]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[68]

The majority is not always right: RL training for solution aggregation.CoRR, abs/2509.06870, 2025

Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: RL training for solution aggregation.CoRR, abs/2509.06870, 2025

work page arXiv 2025
[69]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language P...

2025
[70]

Failure Map

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024...

2024
[71]

Built in 1800s -- convict labor began late 1870s (could count as 1800s, even if main building was 1932)

1932
[72]

City has a university --Sugar Land doesn’t have a university, but Texas A&M is nearby (∼30 mi)
[73]

Enrollment 75k-80k -- Wikipedia showed 74,829 / 77,491 (close to range)
[74]

Two wars --unclear but suggested
[75]

some sources say 111k-118k, official city says 84k-87k (conflicting)

City population 100k-125k -- “some sources say 111k-118k, official city says 84k-87k (conflicting)”
[76]

conflicting

5-year increase 5k-10k -- “conflicting.” Exact Answer:Texas Prison System Central State Farm Main Building (Central Unit), Sugar Land, Texas Confidence: 65%(gold answer: Fort Henry) Take-away.Shared memory faithfully recorded both the unique positive lead and the disqualifying evidence, but the compressed summary made the local uniqueness of one clue more...
[77]

The user’s goal, constraints, and preferences
[78]

Key facts established during the conversation
[79]

Tools used and the most important results from them
[80]

Partial conclusions, promising leads, and failed approaches

Showing first 80 references.

[1] [1]

Bradley C. A. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.CoRR, abs/2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob P

Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob P. Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton ...

work page arXiv 2026

[3] [3]

Iterresearch: Rethinking long-horizon agents with interaction scaling, 2026

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Iterresearch: Rethinking long-horizon agents with interaction scaling, 2026

2026

[4] [4]

Reconcile: Round-table conference improves reasoning via consensus among diverse llms

Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16,...

2024

[5] [5]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, ICLR 202...

2024

[6] [6]

Mem0: Building production-ready AI agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. In Inês Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors,ECAI 2025 - 28th European Conference on Artificial Intelligence, 25-...

2025

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, IC...

2024

[9] [9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation.CoRR, abs/2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Agentswing: Adap- tive parallel context management routing for long-horizon web agents.CoRR, abs/2603.27490, 2026

Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Run- nan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Yong Jiang, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, and Jingren Zhou. Agentswing: Adap- tive parallel context management routing for long-horizon web agents.CoRR, abs/2603.27490, 2026

work page arXiv 2026

[11] [11]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024

[12] [12]

Memory matters more: Event-centric memory as a logic map for agent searching and reasoning.CoRR, abs/2601.04726, 2026

Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, and Zhicheng Dou. Memory matters more: Event-centric memory as a logic map for agent searching and reasoning.CoRR, abs/2601.04726, 2026

work page arXiv 2026

[13] [13]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi- agent-based code generation with iterative testing and optimisation.CoRR, abs/2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. In Guodong Long, Michale Blumestein, Yi Chang, Liane Lewin-Eytan, Zi Helen Huang, and Elad Yom-Tov, editors,Companion Proceedings of the ACM on Web Confere...

2025

[17] [17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orl...

2022

[18] [18]

Agentic aggregation for parallel scaling of long-horizon agentic tasks, 2026

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks, 2026

2026

[19] [19]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Parallelmuse: Agentic parallel thinking for deep information seeking.CoRR, abs/2510.24698, 2025

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking.CoRR, abs/2510.24698, 2025

work page arXiv 2025

[21] [21]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- matio...

2023

[22] [22]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. Trans. Mach. Learn. Res., 2024, 2024. 11

2024

[23] [23]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent.CoRR, abs/2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, ...

2025

[25] [25]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.CoRR, abs/2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Encouraging divergent thinking in large language models through multi- agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024,...

2024

[27] [27]

A dynamic llm-powered agent network for task-oriented agent collaboration, 2024

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration, 2024

2024

[28] [28]

Courier Corporation, 1987

Alfred Mann.The Study of Fugue. Courier Corporation, 1987

1987

[29] [29]

Candès, and Tatsunori Hashimoto

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural ...

2025

[30] [30]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback.CoRR, abs/2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2025. Accessed: 2026-05-06

2025

[33] [33]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Sy...

2024

[34] [34]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Sum- mer Yue, Alexandr Wang, and Dan Hendrycks. Humanity’s last exam.CoRR, abs/2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Learning to reason across parallel samples for LLM reasoning.CoRR, abs/2506.09014, 2025

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for LLM reasoning.CoRR, abs/2506.09014, 2025. 12

work page arXiv 2025

[36] [36]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

2024

[37] [37]

Scaling large language model- based multi-agent collaboration

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model- based multi-agent collaboration. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[38] [38]

Memobrain: Executive memory as an agentic brain for reasoning.CoRR, abs/2601.08079, 2026

Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning.CoRR, abs/2601.08079, 2026

work page arXiv 2026

[39] [39]

Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents.CoRR, abs/2509.13309, 2025

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents.CoRR, abs/2509.13309, 2025

work page arXiv 2025

[40] [40]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning...

2024

[41] [41]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems...

2023

[42] [42]

Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proce...

2023

[43] [43]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

2023

[44] [44]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025

[45] [45]

OpenReview.net, 2025

2025

[46] [46]

Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

work page arXiv 2025

[47] [47]

Memsifter: Offloading llm memory retrieval via outcome-driven proxy reasoning, 2026

Jiejun Tan, Zhicheng Dou, Liancheng Zhang, Yuyang Hu, Yiruo Cheng, and Ji-Rong Wen. Memsifter: Offloading llm memory retrieval via outcome-driven proxy reasoning, 2026

2026

[48] [48]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res., 2024, 2024

2024

[52] [52]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[53] [53]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

2023

[54] [54]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria...

2024

[55] [55]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023

[56] [56]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.CoRR, abs/2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

2022

[58] [58]

Widesearch: Benchmarking agentic broad info-seeking.CoRR, abs/2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.CoRR, abs/2508.07999, 2025

work page arXiv 2025

[59] [59]

Webdancer: Towards autonomous information seeking agency.CoRR, abs/2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency.CoRR, abs/2505.22648, 2025

work page arXiv 2025

[60] [60]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework.CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

work page arXiv 2025

[62] [62]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: agentic memory for LLM agents.CoRR, abs/2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Inf...

2023

[64] [64]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023

[65] [65]

OpenReview.net, 2023

2023

[66] [66]

Agentfold: Long-horizon web agents with proactive context management

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management. CoRR, abs/2510.24699, 2025

work page arXiv 2025

[67] [67]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[68] [68]

The majority is not always right: RL training for solution aggregation.CoRR, abs/2509.06870, 2025

Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: RL training for solution aggregation.CoRR, abs/2509.06870, 2025

work page arXiv 2025

[69] [69]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language P...

2025

[70] [70]

Failure Map

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024...

2024

[71] [71]

Built in 1800s -- convict labor began late 1870s (could count as 1800s, even if main building was 1932)

1932

[72] [72]

City has a university --Sugar Land doesn’t have a university, but Texas A&M is nearby (∼30 mi)

[73] [73]

Enrollment 75k-80k -- Wikipedia showed 74,829 / 77,491 (close to range)

[74] [74]

Two wars --unclear but suggested

[75] [75]

some sources say 111k-118k, official city says 84k-87k (conflicting)

City population 100k-125k -- “some sources say 111k-118k, official city says 84k-87k (conflicting)”

[76] [76]

conflicting

5-year increase 5k-10k -- “conflicting.” Exact Answer:Texas Prison System Central State Farm Main Building (Central Unit), Sugar Land, Texas Confidence: 65%(gold answer: Fort Henry) Take-away.Shared memory faithfully recorded both the unique positive lead and the disqualifying evidence, but the compressed summary made the local uniqueness of one clue more...

[77] [77]

The user’s goal, constraints, and preferences

[78] [78]

Key facts established during the conversation

[79] [79]

Tools used and the most important results from them

[80] [80]

Partial conclusions, promising leads, and failed approaches