Self-Evolving Deep Research via Joint Generation and Evaluation

Chengkun Cai; Han Zhu; Sirui Han; Xing Chen; Yike Guo; Yuanfeng Song

arxiv: 2606.04507 · v1 · pith:L5H73GYSnew · submitted 2026-06-03 · 💻 cs.CL · cs.AI

Self-Evolving Deep Research via Joint Generation and Evaluation

Han Zhu , Chengkun Cai , Yuanfeng Song , Xing Chen , Sirui Han , Yike Guo This is my paper

Pith reviewed 2026-06-28 06:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-evolving frameworkco-evolutionary trainingdeep research generationshared-parameter modelmeta-harnessLLM evaluationreport generation

0 comments

The pith

A shared-parameter model jointly evolves a research solver and its evaluator to sustain optimization on open-ended tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that coupling generation and evaluation inside one model, with a meta-harness that raises evaluation difficulty as the solver improves, produces better deep-research reports than static evaluators. Static judges saturate because they cannot raise their standards when the solver gets stronger, and reward signals for research reports lack verifiable ground truth. The co-evolutionary loop lets the evaluator and solver improve together while the harness prevents collapse into trivial or biased criteria. A reader would care because this supplies a concrete route to train agents on tasks where human rubrics or fixed judges quickly stop providing useful gradients.

Core claim

SCORE is a self-evolving co-evolutionary training framework that places an evaluator and a solver inside the same parameter set so they improve jointly; a meta-harness then dynamically adjusts the evaluation environment according to the solver’s current performance, pushing the evaluator toward valid dimensions and deeper search. Experiments on deep-research benchmarks show consistent gains in report quality, demonstrating that co-evolving evaluation and generation supplies sustained optimization pressure where isolated modules plateau.

What carries the argument

The SCORE shared-parameter co-evolutionary loop controlled by a meta-harness that scales evaluation depth with solver performance.

If this is right

Report generation quality rises steadily instead of saturating.
Evaluation rubrics adapt automatically to the solver’s current level.
The same model can serve as both generator and judge without separate training runs.
Open-ended research agents become trainable without hand-crafted reward functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could transfer to other subjective-output domains such as hypothesis generation or long-form creative writing.
Shared parameters may reduce the usual misalignment between a separate judge and the generator it scores.
If the harness tuning is insufficient, the loop risks rewarding superficial improvements that the evaluator itself learns to accept.

Load-bearing premise

The meta-harness can keep raising evaluation standards in response to solver gains without letting the shared model settle on trivial or biased criteria.

What would settle it

Train the same base model once with the full SCORE loop and once with a frozen static evaluator; if final report quality is statistically indistinguishable, the co-evolutionary claim does not hold.

Figures

Figures reproduced from arXiv: 2606.04507 by Chengkun Cai, Han Zhu, Sirui Han, Xing Chen, Yike Guo, Yuanfeng Song.

**Figure 2.** Figure 2: SCORE is controlled by Meta-Harness. The evaluator first selects evaluation dimensions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study for individual modules on DeepResearchBench. Comp.: comprehensivenss, IF: instruction following, Read.: Readability, VC: valid ratio of citation. Module Ablation Study Refer to [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Training Details Experiments on Rollout Number We investigate the impact of the rollout number K in the training progress [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Training Details 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Adaptive evaluation vs static evaluation of different queries. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is shared-parameter co-evolution of solver and evaluator plus a meta-harness, but the abstract supplies no metrics or mechanism details to show the harness actually prevents biased collapse.

read the letter

The core claim is that static evaluators hit a wall as the solver improves, so SCORE couples them in one model that updates jointly, with a meta-harness that tweaks the evaluation environment from performance signals to keep dimensions valid and search deep.

That framing of the problem is clear and the shared-parameter move is a direct response to the saturation issue in prior LLM-as-judge work. The idea that generation and evaluation can improve each other through the same weights is at least plausible on its face.

The soft spot is exactly the one in the stress-test note. Because the weights are shared, the model has an obvious incentive to generate easy-to-satisfy rubrics. The meta-harness is supposed to block that by supplying external pressure, yet the description gives no equations, no pseudocode, and no ablation showing that the harness actually maintains depth or blocks trivial standards. The claim of consistent benchmark gains is stated without numbers or protocol, so it cannot be evaluated.

This is aimed at people already working on open-ended LLM agents and reward design without ground truth. A reader in that niche would want to see the harness implementation and the ablations before deciding whether the co-evolution is doing real work.

It deserves peer review because the bottleneck it names is real and the proposed direction is distinct from the static baselines it cites, even though the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces SCORE, a self-evolving co-evolutionary training framework for deep research report generation and evaluation. It tightly couples an evaluator and solver within a single shared-parameter model and introduces a meta-harness that dynamically modulates the evaluation environment according to solver performance signals, with the goal of sustaining optimization pressure and avoiding saturation. The abstract asserts that extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality.

Significance. If the empirical claims and the meta-harness mechanism hold, the work would constitute a substantive contribution to training LLMs on open-ended tasks lacking ground truth, by replacing static LLM-as-a-judge evaluators with a jointly evolving system. The shared-parameter co-evolution idea directly targets the saturation problem identified in prior approaches and could influence future agent training pipelines if the harness is shown to maintain valid evaluation dimensions.

major comments (2)

[Abstract] Abstract: the central empirical claim that the framework yields 'consistent improvement in report generation quality' is stated without any metrics, ablation results, baseline comparisons, or description of the experimental protocol, so the soundness of the SCORE framework cannot be assessed from the supplied text.
[Abstract] Abstract: the meta-harness is asserted to 'dynamically control the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search,' yet no analysis, pseudocode, or experimental evidence is supplied to show that this external pressure is sufficient to prevent the shared-parameter model from converging on trivial or self-reinforcing rubrics, which is load-bearing for the co-evolutionary claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract's level of detail. We address each comment below and will revise the abstract accordingly to improve clarity and informativeness while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that the framework yields 'consistent improvement in report generation quality' is stated without any metrics, ablation results, baseline comparisons, or description of the experimental protocol, so the soundness of the SCORE framework cannot be assessed from the supplied text.

Authors: The abstract is intended as a high-level summary and therefore omits specific numbers and protocol details, which are provided in the full manuscript (Experiments section, including quantitative metrics on deep research benchmarks, ablation studies on shared-parameter co-evolution, and comparisons against static LLM-as-a-judge baselines). We agree the abstract could better convey the empirical support and will revise it to include representative performance gains and a brief protocol outline. revision: yes
Referee: [Abstract] Abstract: the meta-harness is asserted to 'dynamically control the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search,' yet no analysis, pseudocode, or experimental evidence is supplied to show that this external pressure is sufficient to prevent the shared-parameter model from converging on trivial or self-reinforcing rubrics, which is load-bearing for the co-evolutionary claim.

Authors: The meta-harness design, including its performance-signal modulation to avoid rubric collapse, along with supporting analysis and pseudocode, appears in Section 3; empirical results demonstrating sustained optimization pressure are in the Experiments section. The abstract does not include these elements. We will revise the abstract to briefly note the harness mechanism and its role in maintaining valid evaluation dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity; framework proposal is self-contained without reductions to inputs

full rationale

The paper describes a methodological proposal for a co-evolutionary training framework (SCORE) that jointly trains evaluator and solver via shared parameters, with a meta-harness to modulate the environment. The abstract and available text contain no equations, parameter fits presented as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work. No derivation chain reduces any claimed result to its own inputs by construction; the approach is an empirical training recipe whose validity rests on external benchmarks rather than internal redefinition. This is the normal case of a non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, loss terms, or modeling choices; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5753 in / 1011 out tokens · 38276 ms · 2026-06-28T06:28:44.277525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 10 canonical work pages · 1 internal anchor

[1]

URL https://docs.gptr.dev/docs/gpt-researcher/ getting-started/introduction

gpt-researcher. URL https://docs.gptr.dev/docs/gpt-researcher/ getting-started/introduction
[2]

URLhttps://github.com/langchain-ai/open_deep_research

open-deep-research. URLhttps://github.com/langchain-ai/open_deep_research
[3]

URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-2/

Introduction gpt-5.2, 2025. URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-2/

2025
[4]

Glance-or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning.arXiv preprint arXiv:2601.13942, 2026

Hongbo Bai, Yujin Zhou, Yile Wu, Chi-Min Chan, Pengcheng Wen, Kunhao Pan, Sirui Han, and Yike Guo. Glance-or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning.arXiv preprint arXiv:2601.13942, 2026

Pith/arXiv arXiv 2026
[5]

Agentcpm-explore: Realizing long-horizon deep exploration for edge-scale agents, 2026

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Agentcpm-explore: Realizing long-horizon deep exploration for edge-scale agents, 2026. URLhttps://arxiv.org/abs/ 2602.06485

arXiv 2026
[6]

SPar: Self-play with tree-search refinement to improve instruction-following in large language models

Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, and Minlie Huang. SPar: Self-play with tree-search refinement to improve instruction-following in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=9chRqsPOGL

2025
[7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

Pith/arXiv arXiv 2025
[8]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, et al. Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv 2025
[9]

On group relative policy optimization collapse in agent search: The lazy likelihood-displacement,

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, and Xiaoxiao Li. On group relative policy optimization collapse in agent search: The lazy likelihood-displacement,
[10]

URLhttps://arxiv.org/abs/2512.04220

arXiv
[11]

Sutherland, Xiaoxiao Li, and Christos Thram- poulidis

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, and Christos Thram- poulidis. On the effect of negative gradient in group relative deep reinforcement optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=2K9QsDaqkM

2026
[12]

Adarubric: Task-adaptive rubrics for llm agent evaluation, 2026

Liang Ding. Adarubric: Task-adaptive rubrics for llm agent evaluation, 2026. URL https: //arxiv.org/abs/2603.21362

Pith/arXiv arXiv 2026
[13]

A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

Pith/arXiv arXiv 2025
[14]

Deepresearch bench: A comprehensive benchmark for deep research agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Licheng Zhang, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=hQ0K2Hhq7H. 10

2026
[15]

SeRL: Self-play reinforcement learning for large language models with limited data

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. SeRL: Self-play reinforcement learning for large language models with limited data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=ZF93vyH9He

2026
[16]

Dr-arena: an automated evaluation framework for deep research agents, 2026

Yiwen Gao, Ruochen Zhao, Yang Deng, and Wenxuan Zhang. Dr-arena: an automated evaluation framework for deep research agents, 2026. URL https://arxiv.org/abs/2601. 10504

2026
[17]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407. 21783

2024
[18]

How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Conference on Learning Representations, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan ang Gao, Yuchen Zhang, Lifan Yuan, Bowen Zhou, Zhiyuan Liu, and Ning Ding. How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Con...

2026
[19]

Step-deepresearch technical report, 2025

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

arXiv 2025
[20]

R-zero: Self-evolving reasoning LLM from zero data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=96apU6YzSO

2026
[21]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown un- knowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917–9955, Miami, Flo...

work page doi:10.18653/v1/ 2024
[22]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=Rwhi91ideu

2025
[23]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8euJaTveKw

2024
[24]

Meta-harness: End-to-end optimization of model harnesses, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/ abs/2603.28052

Pith/arXiv arXiv 2026
[25]

Writing-rl: Advancing long-form writing via adaptive curriculum reinforcement learning, 2026

Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Fei Huang, Ya-Qin Zhang, and Yang Liu. Writing-rl: Advancing long-form writing via adaptive curriculum reinforcement learning, 2026. URLhttps://arxiv.org/abs/2506.05760. 11

Pith/arXiv arXiv 2026
[26]

The dawn after the dark: An empirical study on factuality hallucination in large language models

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. The dawn after the dark: An empirical study on factuality hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2024.acl-long.586 2024
[27]

Multiple human motion understanding

Lei Li, Sen Jia, and Jenq-Neng Hwang. Multiple human motion understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6297–6305, 2026

2026
[28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...

work page doi:10.18653/v1/2025.emnlp-main.276 2025
[29]

Webthinker: Empowering large reasoning models with deep research capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
[30]

URLhttps://openreview.net/forum?id=7LKKHBAMzH
[31]

Agentcpm-report: Interleaving drafting and deepening for open-ended deep research, 2026

Yishan Li, Wentong Chen, Yukun Yan, Mingwei Li, Sen Mei, Xiaorong Wang, Kunpeng Liu, Xin Cong, Shuo Wang, Zhong Zhang, Yaxi Lu, Zhenghao Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Agentcpm-report: Interleaving drafting and deepening for open-ended deep research, 2026. URLhttps://arxiv.org/abs/2602.06540

arXiv 2026
[32]

SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Bo Liu, Simon Yu, Zichen Liu, Leon Guertler, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //ope...

2026
[33]

Search self-play: Pushing the frontier of agent capability without supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, xiaoxi jiang, and guanjunjiang. Search self-play: Pushing the frontier of agent capability without supervision. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ZmGirmNJqE

2026
[34]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

Pith/arXiv arXiv 2025
[35]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026
[36]

Assisting in writing Wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

work page doi:10.18653/v1/2024.naacl-long.347 2024
[37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open 12 language models.CoRR, abs/2402.03300, 2024. URL https://doi.org/10.48550/arXiv. 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[38]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/ 10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025
[39]

Intrinsic entropy of context length scaling in llms

Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jenq-Neng Hwang, and Lei Li. Intrinsic entropy of context length scaling in llms. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[40]

Iterative self-incentivization empowers large language models as agentic searchers

Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, and Zhaochun Ren. Iterative self-incentivization empowers large language models as agentic searchers. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=s9NkfkUuEr

2026
[41]

R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.05592

Pith/arXiv arXiv 2025
[42]

SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Find...

work page doi:10.18653/v1/2025.findings-emnlp.739 2025
[43]

Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data

Yunhao Tang, Sid Wang, Lovish Madaan, and Remi Munos. Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=pc6M9h3T9m

2026
[44]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025

5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025. URL https:// arxiv.org/abs/2508.06471

Pith/arXiv arXiv 2025
[45]

Deep-research eval: An automated framework for assessing quality and reliability in long-form reports

Yeerpan Tuohetiyaer, Yuye Zhu, Yan Hu, Siyuan Lu, and Zhongfeng Wang. Deep-research eval: An automated framework for assessing quality and reliability in long-form reports. Applied Sciences, 16(5), 2026. ISSN 2076-3417. doi: 10.3390/app16052546. URL https: //www.mdpi.com/2076-3417/16/5/2546

work page doi:10.3390/app16052546 2026
[46]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992
[47]

Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, and Carl Yang

Ran Xu, Yuchen Zhuang, Zihan Dong, Ruiyu Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, and Carl Yang. Acesearcher: Bootstrapping reasoning and search for LLMs via reinforced self-play. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=jSgCM0uZn3

2026
[48]

3dsceneeditor: Controllable 3d scene editing with gaussian splatting

Ziyang Yan, Yihua Shao, Minwen Liao, Siyu Chen, Nan Wang, Muyuan Lin, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino, and Lei Li. 3dsceneeditor: Controllable 3d scene editing with gaussian splatting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1852–1863, March 2026

2026
[49]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025
[50]

Trajectory-LLM: A language-based data generator for trajectory prediction in autonomous driving

Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Zhao Huang, Yipeng Wu, Die Zuo, Jibin Peng, Ziyuan Zhong, Xin WANG, Qing Guo, Xiaosong Jia, Junchi Yan, and Di Lin. Trajectory-LLM: A language-based data generator for trajectory prediction in autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://open...

2025
[51]

Countllm: Towards generalizable repetitive action counting via large language model

Ziyu Yao, Xuxin Cheng, Zhiqi Huang, and Lei Li. Countllm: Towards generalizable repetitive action counting via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[52]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

arXiv 2026
[53]

EvolveSearch: An iterative self- evolving search agent

Ding-Chu Zhang, Yida Zhao, Jialong Wu, Liwen Zhang, Baixuan Li, Wenbiao Yin, Yong Jiang, Yu-Feng Li, Kewei Tu, Pengjun Xie, and Fei Huang. EvolveSearch: An iterative self- evolving search agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lang...

work page doi:10.18653/v1/2025.emnlp-main.663 2025
[54]

Beyond length scaling: Synergizing breadth and depth for generative reward models, 2026

Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, and Chen Ma. Beyond length scaling: Synergizing breadth and depth for generative reward models, 2026. URLhttps://arxiv.org/abs/2603.01571

arXiv 2026
[55]

A survey on self-play methods in reinforcement learning.arXiv preprint arXiv:2408.01072, 2024

Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, et al. A survey on self-play methods in reinforcement learning.arXiv preprint arXiv:2408.01072, 2024

arXiv 2024
[56]

Psgs: Text-driven panorama sliding scene generation via gaussian splatting.arXiv preprint arXiv:2602.00463, 2026

Xin Zhang, Shen Chen, Jiale Zhou, and Lei Li. Psgs: Text-driven panorama sliding scene generation via gaussian splatting.arXiv preprint arXiv:2602.00463, 2026

arXiv 2026
[57]

Absolute zero: Reinforced self-play reasoning with zero data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=neZSGqhxDa

2025
[58]

Learning to reason without external rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OU9nFEYR2M

2026
[59]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[60]

2025 , address =

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language P...

work page doi:10.18653/v1/2025.emnlp-main.22 2025
[61]

Lras: Advanced legal reasoning with agentic search, 2026

Yujin Zhou, Chuxue Cao, Jinluan Yang, Lijun Wu, Conghui He, Sirui Han, and Yike Guo. Lras: Advanced legal reasoning with agentic search, 2026. URL https://arxiv.org/abs/2601. 07296. 14 A Limitations Despite the effectiveness of our self-evolving framework, we acknowledge several limitations in our current study. First, the unique nature of deep research r...

2026
[62]

the raw evaluator-side consistency signal estimates the expected reproducibility induced by a rubric under repeated solver rollouts
[63]

no other means

the shared-parameter solver-evaluator alternating updates follow, to first order, a KL- regularized surrogate descent direction, with only a higher-order residual due to sequential updating. G Experiments G.1 Training Details Table 2: Main training hyperparameters in SCORE. Category Configuration Learning rate1×10 −6 Warmup 5 steps, warmup ratio 0.05 Trai...

2048
[64]

C o m p r e h e n s i v e n e s s ( I n f o r m a t i o n C ov er ag e ) - E va lua te the breadth , depth , and r e l e v a n c e of i n f o r m a t i o n cov er ag e . - Ensure the article covers key i n f o r m a t i o n areas , perspectives , and depths n e c e s s a r y to achieve c o m p r e h e n s i v e market or s i t u a t i o n a l an al ysi s ...
[65]

22 - Assess whether the a na ly sis goes beyond a s u p e r f i c i a l listing of factors to uncover core drivers , subtle mechanisms , and second - order effects

Insight ( A n a l y t i c a l Depth ) - E va lua te the depth , originality , logic , and value of the a na ly sis and c o n c l u s i o n s . 22 - Assess whether the a na ly sis goes beyond a s u p e r f i c i a l listing of factors to uncover core drivers , subtle mechanisms , and second - order effects . - Examine if the c o n c l u s i o n s and r e c...
[66]

- Check for strict a d h e r e n c e to scope l i m i t a t i o n s ( e

I n s t r u c t i o n F o l l o w i n g ( R e s p o n s i v e n e s s & R e l e v a n c e ) - E va lua te whether the report accurately , completely , and di re ct ly r es pon ds to all spe ci fi c instructions , questions , and core o b j e c t i v e s within the ‘< task > ‘. - Check for strict a d h e r e n c e to scope l i m i t a t i o n s ( e . g . ,...
[67]

d i m e n s i o n _ c o n s t r a i n t s

R e a d a b i l i t y ( P r e s e n t a t i o n Quality ) - E va lua te the clarity of structure , fluency of language , e f f e c t i v e n e s s of data presentation , and overall ease of u n d e r s t a n d i n g . - Assess s t r u c t u r a l logic ( clear f r a m e w o r k and logical flow ) , la ng ua ge p r e c i s i o n ( fluent , g r a m m a t i ...
[68]

{ m u s t _ i n c l u d e } is always re qu ir ed as the primary scoring d i m e n s i o n . Select exactly {k -1} a d d i t i o n a l d i m e n s i o n s from the r e m a i n i n g pool and assign weights , each a d d i t i o n a l dimension ’ s weight must be within [{ Weight Min } , { Weight Max }]
[69]

Verify key claims and c i t a t i o n s in each report using the a v a i l a b l e search tool
[70]

Perform i n d e p e n d e n t re sea rc h from the o ri gi na l query to i de nti fy i m p o r t a n t gaps not covered by each report
[71]

[ Base D i m e n s i o n ]

Score each report on every se le cte d d i m e n s i o n using the e vi de nce you found . ## Tool - use pr ot oc ol : - When you need ex te rn al evidence , you can use search tools with search query in this format : < search > your search query here </ search > - The content inside < search > must be only the query text . Do not output JSON , function -...
[72]

Think inside < think > and </ think > before each action and after each new search result
[73]

Use the exact format < search > your query here </ search > to issue a search

When ex te rn al e vi den ce is needed , emit exactly one s t a n d a l o n e search query . Use the exact format < search > your query here </ search > to issue a search
[74]

Do not output JSON , tool names , XML attributes , bullet lists , or m ult ip le queries in one search block

The content inside < search > must be only the query text . Do not output JSON , tool names , XML attributes , bullet lists , or m ult ip le queries in one search block
[75]

Prefer natural web search queries : start broad , then narrow to verification , recent developments , entity - s pec if ic facts , or missing p e r s p e c t i v e s
[76]

Prefer m ul ti ple s e q u e n t i a l sea rc he s over one o v e r l o a d e d query
[77]

Stop s e a r c h i n g once you can answer c o n f i d e n t l y

Use search to verify i m p o r t a n t claims , gather evidence , and close i n f o r m a t i o n gaps . Stop s e a r c h i n g once you can answer c o n f i d e n t l y
[78]

The e n v i r o n m e n t will return search results between < observation > and </ observation >
[79]

When you have enough evidence , write the final re sp on se only inside < answer > and </ answer >
[80]

24 Listing 5: Prompt for solver 25

In the final answer , s y n t h e s i z e the evidence , note u n c e r t a i n t y when necessary , and include source m en ti ons or a short Sources section . 24 Listing 5: Prompt for solver 25

[1] [1]

URL https://docs.gptr.dev/docs/gpt-researcher/ getting-started/introduction

gpt-researcher. URL https://docs.gptr.dev/docs/gpt-researcher/ getting-started/introduction

[2] [2]

URLhttps://github.com/langchain-ai/open_deep_research

open-deep-research. URLhttps://github.com/langchain-ai/open_deep_research

[3] [3]

URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-2/

Introduction gpt-5.2, 2025. URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-2/

2025

[4] [4]

Glance-or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning.arXiv preprint arXiv:2601.13942, 2026

Hongbo Bai, Yujin Zhou, Yile Wu, Chi-Min Chan, Pengcheng Wen, Kunhao Pan, Sirui Han, and Yike Guo. Glance-or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning.arXiv preprint arXiv:2601.13942, 2026

Pith/arXiv arXiv 2026

[5] [5]

Agentcpm-explore: Realizing long-horizon deep exploration for edge-scale agents, 2026

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Agentcpm-explore: Realizing long-horizon deep exploration for edge-scale agents, 2026. URLhttps://arxiv.org/abs/ 2602.06485

arXiv 2026

[6] [6]

SPar: Self-play with tree-search refinement to improve instruction-following in large language models

Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, and Minlie Huang. SPar: Self-play with tree-search refinement to improve instruction-following in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=9chRqsPOGL

2025

[7] [7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

Pith/arXiv arXiv 2025

[8] [8]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, et al. Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv 2025

[9] [9]

On group relative policy optimization collapse in agent search: The lazy likelihood-displacement,

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, and Xiaoxiao Li. On group relative policy optimization collapse in agent search: The lazy likelihood-displacement,

[10] [10]

URLhttps://arxiv.org/abs/2512.04220

arXiv

[11] [11]

Sutherland, Xiaoxiao Li, and Christos Thram- poulidis

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, and Christos Thram- poulidis. On the effect of negative gradient in group relative deep reinforcement optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=2K9QsDaqkM

2026

[12] [12]

Adarubric: Task-adaptive rubrics for llm agent evaluation, 2026

Liang Ding. Adarubric: Task-adaptive rubrics for llm agent evaluation, 2026. URL https: //arxiv.org/abs/2603.21362

Pith/arXiv arXiv 2026

[13] [13]

A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

Pith/arXiv arXiv 2025

[14] [14]

Deepresearch bench: A comprehensive benchmark for deep research agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Licheng Zhang, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=hQ0K2Hhq7H. 10

2026

[15] [15]

SeRL: Self-play reinforcement learning for large language models with limited data

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. SeRL: Self-play reinforcement learning for large language models with limited data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=ZF93vyH9He

2026

[16] [16]

Dr-arena: an automated evaluation framework for deep research agents, 2026

Yiwen Gao, Ruochen Zhao, Yang Deng, and Wenxuan Zhang. Dr-arena: an automated evaluation framework for deep research agents, 2026. URL https://arxiv.org/abs/2601. 10504

2026

[17] [17]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407. 21783

2024

[18] [18]

How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Conference on Learning Representations, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan ang Gao, Yuchen Zhang, Lifan Yuan, Bowen Zhou, Zhiyuan Liu, and Ning Ding. How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Con...

2026

[19] [19]

Step-deepresearch technical report, 2025

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

arXiv 2025

[20] [20]

R-zero: Self-evolving reasoning LLM from zero data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=96apU6YzSO

2026

[21] [21]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown un- knowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917–9955, Miami, Flo...

work page doi:10.18653/v1/ 2024

[22] [22]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=Rwhi91ideu

2025

[23] [23]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8euJaTveKw

2024

[24] [24]

Meta-harness: End-to-end optimization of model harnesses, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/ abs/2603.28052

Pith/arXiv arXiv 2026

[25] [25]

Writing-rl: Advancing long-form writing via adaptive curriculum reinforcement learning, 2026

Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Fei Huang, Ya-Qin Zhang, and Yang Liu. Writing-rl: Advancing long-form writing via adaptive curriculum reinforcement learning, 2026. URLhttps://arxiv.org/abs/2506.05760. 11

Pith/arXiv arXiv 2026

[26] [26]

The dawn after the dark: An empirical study on factuality hallucination in large language models

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. The dawn after the dark: An empirical study on factuality hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2024.acl-long.586 2024

[27] [27]

Multiple human motion understanding

Lei Li, Sen Jia, and Jenq-Neng Hwang. Multiple human motion understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6297–6305, 2026

2026

[28] [28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...

work page doi:10.18653/v1/2025.emnlp-main.276 2025

[29] [29]

Webthinker: Empowering large reasoning models with deep research capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

[30] [30]

URLhttps://openreview.net/forum?id=7LKKHBAMzH

[31] [31]

Agentcpm-report: Interleaving drafting and deepening for open-ended deep research, 2026

Yishan Li, Wentong Chen, Yukun Yan, Mingwei Li, Sen Mei, Xiaorong Wang, Kunpeng Liu, Xin Cong, Shuo Wang, Zhong Zhang, Yaxi Lu, Zhenghao Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Agentcpm-report: Interleaving drafting and deepening for open-ended deep research, 2026. URLhttps://arxiv.org/abs/2602.06540

arXiv 2026

[32] [32]

SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Bo Liu, Simon Yu, Zichen Liu, Leon Guertler, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //ope...

2026

[33] [33]

Search self-play: Pushing the frontier of agent capability without supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, xiaoxi jiang, and guanjunjiang. Search self-play: Pushing the frontier of agent capability without supervision. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ZmGirmNJqE

2026

[34] [34]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

Pith/arXiv arXiv 2025

[35] [35]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026

[36] [36]

Assisting in writing Wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

work page doi:10.18653/v1/2024.naacl-long.347 2024

[37] [37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open 12 language models.CoRR, abs/2402.03300, 2024. URL https://doi.org/10.48550/arXiv. 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[38] [38]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/ 10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025

[39] [39]

Intrinsic entropy of context length scaling in llms

Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jenq-Neng Hwang, and Lei Li. Intrinsic entropy of context length scaling in llms. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[40] [40]

Iterative self-incentivization empowers large language models as agentic searchers

Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, and Zhaochun Ren. Iterative self-incentivization empowers large language models as agentic searchers. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=s9NkfkUuEr

2026

[41] [41]

R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.05592

Pith/arXiv arXiv 2025

[42] [42]

SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Find...

work page doi:10.18653/v1/2025.findings-emnlp.739 2025

[43] [43]

Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data

Yunhao Tang, Sid Wang, Lovish Madaan, and Remi Munos. Beyond verifiable rewards: Scaling reinforcement learning in language models to unverifiable data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=pc6M9h3T9m

2026

[44] [44]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025

5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025. URL https:// arxiv.org/abs/2508.06471

Pith/arXiv arXiv 2025

[45] [45]

Deep-research eval: An automated framework for assessing quality and reliability in long-form reports

Yeerpan Tuohetiyaer, Yuye Zhu, Yan Hu, Siyuan Lu, and Zhongfeng Wang. Deep-research eval: An automated framework for assessing quality and reliability in long-form reports. Applied Sciences, 16(5), 2026. ISSN 2076-3417. doi: 10.3390/app16052546. URL https: //www.mdpi.com/2076-3417/16/5/2546

work page doi:10.3390/app16052546 2026

[46] [46]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992

[47] [47]

Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, and Carl Yang

Ran Xu, Yuchen Zhuang, Zihan Dong, Ruiyu Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, and Carl Yang. Acesearcher: Bootstrapping reasoning and search for LLMs via reinforced self-play. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=jSgCM0uZn3

2026

[48] [48]

3dsceneeditor: Controllable 3d scene editing with gaussian splatting

Ziyang Yan, Yihua Shao, Minwen Liao, Siyu Chen, Nan Wang, Muyuan Lin, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino, and Lei Li. 3dsceneeditor: Controllable 3d scene editing with gaussian splatting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1852–1863, March 2026

2026

[49] [49]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025

[50] [50]

Trajectory-LLM: A language-based data generator for trajectory prediction in autonomous driving

Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Zhao Huang, Yipeng Wu, Die Zuo, Jibin Peng, Ziyuan Zhong, Xin WANG, Qing Guo, Xiaosong Jia, Junchi Yan, and Di Lin. Trajectory-LLM: A language-based data generator for trajectory prediction in autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://open...

2025

[51] [51]

Countllm: Towards generalizable repetitive action counting via large language model

Ziyu Yao, Xuxin Cheng, Zhiqi Huang, and Lei Li. Countllm: Towards generalizable repetitive action counting via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[52] [52]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

arXiv 2026

[53] [53]

EvolveSearch: An iterative self- evolving search agent

Ding-Chu Zhang, Yida Zhao, Jialong Wu, Liwen Zhang, Baixuan Li, Wenbiao Yin, Yong Jiang, Yu-Feng Li, Kewei Tu, Pengjun Xie, and Fei Huang. EvolveSearch: An iterative self- evolving search agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lang...

work page doi:10.18653/v1/2025.emnlp-main.663 2025

[54] [54]

Beyond length scaling: Synergizing breadth and depth for generative reward models, 2026

Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, and Chen Ma. Beyond length scaling: Synergizing breadth and depth for generative reward models, 2026. URLhttps://arxiv.org/abs/2603.01571

arXiv 2026

[55] [55]

A survey on self-play methods in reinforcement learning.arXiv preprint arXiv:2408.01072, 2024

Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, et al. A survey on self-play methods in reinforcement learning.arXiv preprint arXiv:2408.01072, 2024

arXiv 2024

[56] [56]

Psgs: Text-driven panorama sliding scene generation via gaussian splatting.arXiv preprint arXiv:2602.00463, 2026

Xin Zhang, Shen Chen, Jiale Zhou, and Lei Li. Psgs: Text-driven panorama sliding scene generation via gaussian splatting.arXiv preprint arXiv:2602.00463, 2026

arXiv 2026

[57] [57]

Absolute zero: Reinforced self-play reasoning with zero data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=neZSGqhxDa

2025

[58] [58]

Learning to reason without external rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OU9nFEYR2M

2026

[59] [59]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[60] [60]

2025 , address =

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language P...

work page doi:10.18653/v1/2025.emnlp-main.22 2025

[61] [61]

Lras: Advanced legal reasoning with agentic search, 2026

Yujin Zhou, Chuxue Cao, Jinluan Yang, Lijun Wu, Conghui He, Sirui Han, and Yike Guo. Lras: Advanced legal reasoning with agentic search, 2026. URL https://arxiv.org/abs/2601. 07296. 14 A Limitations Despite the effectiveness of our self-evolving framework, we acknowledge several limitations in our current study. First, the unique nature of deep research r...

2026

[62] [62]

the raw evaluator-side consistency signal estimates the expected reproducibility induced by a rubric under repeated solver rollouts

[63] [63]

no other means

the shared-parameter solver-evaluator alternating updates follow, to first order, a KL- regularized surrogate descent direction, with only a higher-order residual due to sequential updating. G Experiments G.1 Training Details Table 2: Main training hyperparameters in SCORE. Category Configuration Learning rate1×10 −6 Warmup 5 steps, warmup ratio 0.05 Trai...

2048

[64] [64]

C o m p r e h e n s i v e n e s s ( I n f o r m a t i o n C ov er ag e ) - E va lua te the breadth , depth , and r e l e v a n c e of i n f o r m a t i o n cov er ag e . - Ensure the article covers key i n f o r m a t i o n areas , perspectives , and depths n e c e s s a r y to achieve c o m p r e h e n s i v e market or s i t u a t i o n a l an al ysi s ...

[65] [65]

22 - Assess whether the a na ly sis goes beyond a s u p e r f i c i a l listing of factors to uncover core drivers , subtle mechanisms , and second - order effects

Insight ( A n a l y t i c a l Depth ) - E va lua te the depth , originality , logic , and value of the a na ly sis and c o n c l u s i o n s . 22 - Assess whether the a na ly sis goes beyond a s u p e r f i c i a l listing of factors to uncover core drivers , subtle mechanisms , and second - order effects . - Examine if the c o n c l u s i o n s and r e c...

[66] [66]

- Check for strict a d h e r e n c e to scope l i m i t a t i o n s ( e

I n s t r u c t i o n F o l l o w i n g ( R e s p o n s i v e n e s s & R e l e v a n c e ) - E va lua te whether the report accurately , completely , and di re ct ly r es pon ds to all spe ci fi c instructions , questions , and core o b j e c t i v e s within the ‘< task > ‘. - Check for strict a d h e r e n c e to scope l i m i t a t i o n s ( e . g . ,...

[67] [67]

d i m e n s i o n _ c o n s t r a i n t s

R e a d a b i l i t y ( P r e s e n t a t i o n Quality ) - E va lua te the clarity of structure , fluency of language , e f f e c t i v e n e s s of data presentation , and overall ease of u n d e r s t a n d i n g . - Assess s t r u c t u r a l logic ( clear f r a m e w o r k and logical flow ) , la ng ua ge p r e c i s i o n ( fluent , g r a m m a t i ...

[68] [68]

{ m u s t _ i n c l u d e } is always re qu ir ed as the primary scoring d i m e n s i o n . Select exactly {k -1} a d d i t i o n a l d i m e n s i o n s from the r e m a i n i n g pool and assign weights , each a d d i t i o n a l dimension ’ s weight must be within [{ Weight Min } , { Weight Max }]

[69] [69]

Verify key claims and c i t a t i o n s in each report using the a v a i l a b l e search tool

[70] [70]

Perform i n d e p e n d e n t re sea rc h from the o ri gi na l query to i de nti fy i m p o r t a n t gaps not covered by each report

[71] [71]

[ Base D i m e n s i o n ]

Score each report on every se le cte d d i m e n s i o n using the e vi de nce you found . ## Tool - use pr ot oc ol : - When you need ex te rn al evidence , you can use search tools with search query in this format : < search > your search query here </ search > - The content inside < search > must be only the query text . Do not output JSON , function -...

[72] [72]

Think inside < think > and </ think > before each action and after each new search result

[73] [73]

Use the exact format < search > your query here </ search > to issue a search

When ex te rn al e vi den ce is needed , emit exactly one s t a n d a l o n e search query . Use the exact format < search > your query here </ search > to issue a search

[74] [74]

Do not output JSON , tool names , XML attributes , bullet lists , or m ult ip le queries in one search block

The content inside < search > must be only the query text . Do not output JSON , tool names , XML attributes , bullet lists , or m ult ip le queries in one search block

[75] [75]

Prefer natural web search queries : start broad , then narrow to verification , recent developments , entity - s pec if ic facts , or missing p e r s p e c t i v e s

[76] [76]

Prefer m ul ti ple s e q u e n t i a l sea rc he s over one o v e r l o a d e d query

[77] [77]

Stop s e a r c h i n g once you can answer c o n f i d e n t l y

Use search to verify i m p o r t a n t claims , gather evidence , and close i n f o r m a t i o n gaps . Stop s e a r c h i n g once you can answer c o n f i d e n t l y

[78] [78]

The e n v i r o n m e n t will return search results between < observation > and </ observation >

[79] [79]

When you have enough evidence , write the final re sp on se only inside < answer > and </ answer >

[80] [80]

24 Listing 5: Prompt for solver 25

In the final answer , s y n t h e s i z e the evidence , note u n c e r t a i n t y when necessary , and include source m en ti ons or a short Sources section . 24 Listing 5: Prompt for solver 25