Recognition: no theorem link
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3
The pith
ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReSearch frames search as an integral action within the reasoning trajectory: the model generates text-based thinking to decide when and how to search, receives the result in context, and continues reasoning, all optimized end-to-end by reinforcement learning on outcome rewards alone; this produces strong cross-benchmark generalization and the spontaneous appearance of reflection and self-correction behaviors.
What carries the argument
An outcome-optimized RL policy that generates interleaved sequences of natural-language reasoning steps and explicit search tool calls, feeding retrieved results back into the context to shape the next reasoning segment.
If this is right
- Reflection and self-correction emerge as byproducts of the RL process without explicit rewards for those behaviors.
- Performance generalizes across benchmarks after training on only one dataset.
- Complex questions needing multiple retrieval steps become solvable through learned interleaving rather than fixed pipelines.
- Search integration is driven by the model's internal text reasoning rather than external rules or prompts.
Where Pith is reading between the lines
- The same outcome-reward approach could extend to other tools such as code interpreters or databases to create agents that decide when to use each tool.
- Reducing dependence on supervised reasoning traces might lower the data cost of building more capable models for open-ended tasks.
- Scaling model size or adding more diverse outcome signals could strengthen the observed self-correction without changing the core training setup.
Load-bearing premise
Final-answer correctness rewards alone can shape precise decisions about when to search and how to incorporate results without any intermediate supervision on reasoning steps.
What would settle it
Train the model and test whether it shows no gain over a non-search baseline on multi-hop questions that require two or more sequential retrievals, or whether search timing remains random rather than guided by the model's own text thinking.
read the original abstract
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ReSearch, a reinforcement learning framework that trains LLMs (Qwen2.5-7B and 32B variants) to interleave text-based reasoning with external search operations without any supervised reasoning traces or explicit search supervision. Search is treated as an integral part of the reasoning chain, with the policy learning when and how to invoke searches based on outcome rewards alone; the resulting models are reported to generalize strongly across benchmarks despite single-dataset training and to exhibit emergent reflection and self-correction.
Significance. If the central claims hold, the work would demonstrate that pure outcome-based RL can shape adaptive tool use and elicit advanced reasoning behaviors in LLMs, reducing dependence on supervised fine-tuning for multi-hop tasks. This would be a notable contribution to tool-augmented reasoning, with the single-dataset generalization and emergent capabilities providing a concrete path toward more scalable training of search-integrated reasoners.
major comments (2)
- [Experimental results] Experimental results section: the claims of strong generalization and emergent reflection rest on outcome-based RL shaping search timing, yet no ablations are reported that isolate the contribution of learned search integration (e.g., search-disabled baselines, search-frequency statistics stratified by question complexity, or comparison to fixed-search policies). Without these, it remains possible that performance gains derive from the base model or the search tool itself rather than the RL-induced policy.
- [Method] Method section (trajectory generation and reward): the description of how search results are incorporated into subsequent reasoning steps and how the policy learns contextual search decisions solely from final-answer rewards lacks sufficient detail on state representation and credit assignment; this is load-bearing for the central assertion that no supervised reasoning traces are required.
minor comments (1)
- [Abstract] Abstract: the statement that models 'demonstrate strong generalizability' would be strengthened by including at least one or two key quantitative metrics (e.g., average accuracy lift on the held-out benchmarks) rather than qualitative descriptors alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment point by point below, providing clarifications and indicating revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental results] Experimental results section: the claims of strong generalization and emergent reflection rest on outcome-based RL shaping search timing, yet no ablations are reported that isolate the contribution of learned search integration (e.g., search-disabled baselines, search-frequency statistics stratified by question complexity, or comparison to fixed-search policies). Without these, it remains possible that performance gains derive from the base model or the search tool itself rather than the RL-induced policy.
Authors: We agree that additional ablations would more rigorously isolate the contribution of the learned search policy. In the revised manuscript we will add: (1) a search-disabled baseline in which the model is trained under the same RL setup but without access to the search tool; (2) search-frequency statistics stratified by question complexity (e.g., single-hop vs. multi-hop); and (3) comparisons against fixed-search policies that invoke search at predetermined intervals or always. These experiments will be reported alongside the existing results to demonstrate that performance gains arise from the RL-induced contextual decisions rather than the base model or tool alone. revision: yes
-
Referee: [Method] Method section (trajectory generation and reward): the description of how search results are incorporated into subsequent reasoning steps and how the policy learns contextual search decisions solely from final-answer rewards lacks sufficient detail on state representation and credit assignment; this is load-bearing for the central assertion that no supervised reasoning traces are required.
Authors: We acknowledge that the current description is insufficiently detailed. In the revised method section we will expand the exposition as follows: the state at each step is the complete history of prior reasoning text plus any previously retrieved search results appended verbatim to the context; search results are incorporated by direct concatenation after the current thought, allowing the next token prediction to condition on them. Credit assignment occurs entirely through the terminal outcome reward (binary correctness of the final answer) propagated by the RL algorithm (PPO) with no intermediate supervision or reasoning traces. We will include pseudocode for trajectory sampling, the precise reward function, and an explicit statement that search timing and content decisions are learned solely from final-answer signals. revision: yes
Circularity Check
No significant circularity in ReSearch RL training framework
full rationale
The paper presents an empirical RL method that applies standard outcome-based rewards to search-augmented reasoning trajectories without supervised traces. No equations, derivations, or predictions are shown that reduce claimed capabilities (e.g., reflection, generalizability) to fitted inputs or self-citations by construction. Training on one dataset and evaluation across benchmarks is a direct experimental result, not a self-referential derivation. The framework is self-contained as a training procedure validated externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with final-answer rewards suffices to train when and how to search during reasoning.
Forward citations
Cited by 22 Pith papers
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
Latent Abstraction for Retrieval-Augmented Generation
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA...
-
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
-
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization f...
-
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.
-
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
-
An End-to-End Framework for Building Large Language Models for Software Operations
OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
An End-to-End Framework for Building Large Language Models for Software Operations
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Reference graph
Works this paper leans on
-
[1]
Claude 3.7 sonnet and claude code, 2025
Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet
work page 2025
-
[2]
Self-rag: Learning to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR. OpenReview.net, 2024
work page 2024
-
[3]
RQ-RAG: learning to refine queries for retrieval augmented generation.CoRR, abs/2404.00610, 2024
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. RQ-RAG: learning to refine queries for retrieval augmented generation.CoRR, abs/2404.00610, 2024
-
[4]
Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.CoRR, abs/2410.12952, 2024
-
[5]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.CoRR, abs/2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps. InCOLING, pages 6609–6625. International Committee on Computational Linguistics, 2020
work page 2020
-
[8]
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299, 1(2):4, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research.CoRR, abs/2405.13576, 2024
-
[11]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781. Association for Computational Linguistics, 2020
work page 2020
-
[12]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 10
work page 2020
-
[13]
Baichuan alignment technical report.CoRR, abs/2410.14940, 2024
Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Tao Zhang, Miao Zheng, Xu Li, Yijie Zhou, Mingyang Chen, Yanzhao Qin, Youquan Li, Hao Liang, Fei Li, Yadong Li, Mang Wang, Guosheng Dong, Kun Fang, Jianhua Xu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Baichuan alignment technical report.CoRR, abs/2410.14940, 2024
-
[14]
Agentboard: An analytical evaluation board of multi-turn LLM agents
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents. InNeurIPS, 2024
work page 2024
-
[15]
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024
work page 2024
-
[16]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Learning to reason with LLMs, 2024
OpenAI. Learning to reason with LLMs, 2024. URL https://openai.com/index/ learning-to-reason-with-llms
work page 2024
-
[18]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[19]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InEMNLP (Findings), pages 5687–5711. Association for Computational Linguistics, 2023
work page 2023
-
[20]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023
work page 2023
-
[21]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023
work page 2023
-
[22]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InEMNLP (Findings), pages 9248–9274. Association for Computational Linguistics, 2023
work page 2023
-
[24]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in hugging face. InNeurIPS, 2023
work page 2023
-
[26]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. CoRR, abs/2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
REPLUG: retrieval-augmented black-box language models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG: retrieval-augmented black-box language models. In NAACL-HLT, pages 8371–8384. Association for Computational Linguistics, 2024
work page 2024
-
[28]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999
work page 1999
-
[31]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Musique: Multihop questions via single-hop question composition.Trans
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Trans. Assoc. Comput. Linguistics, 10:539–554, 2022
work page 2022
-
[33]
Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL (1), pages 10014–10037. Association for Computational Linguistics, 2023
work page 2023
-
[34]
Wikidata: a free collaborative knowledgebase.Com- mun
Denny Vrandecic and Markus Krötzsch. Wikidata: a free collaborative knowledgebase.Com- mun. ACM, 57(10):78–85, 2014
work page 2014
-
[35]
Pan, Wen Zhang, and Huajun Chen
Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, and Huajun Chen. Learning to plan for retrieval-augmented large language models from knowledge graphs. InEMNLP (Findings), pages 7813–7835. Association for Computational Linguistics, 2024
work page 2024
-
[36]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022
work page 2022
-
[38]
Corrective Retrieval Augmented Generation
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.CoRR, abs/2401.15884, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Cohen, Ruslan Salakhut- dinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380. Association for Computational Linguistics, 2018
work page 2018
-
[41]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InNeurIPS, 2022
work page 2022
-
[42]
Dense text retrieval based on pretrained language models: A survey.ACM Trans
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey.ACM Trans. Inf. Syst., 42(4):89:1–89:60, 2024. 12 A Prompt for LLM-as-a-Judge Prompt for Extracting Scenarios You will be given a question and its ground truth answer list where each item can be a ground truth answer. Provided a pred_...
work page 2024
-
[43]
The pred_answer doesn’t need to be exactly the same as any of the ground truth answers, but should be semantically same for the question
-
[44]
Each item in the ground truth answer list can be viewed as a ground truth answer for the question, and the pred_answer should be semantically same to at least one of them. question: {question} ground truth answers: {gt_answer} pred_answer: {pred_answer} The output should in the following json format: ‘‘‘json { "rationale": "your rationale for the judgemen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.