MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
Pith reviewed 2026-05-17 21:40 UTC · model grok-4.3
The pith
Scaling the depth and frequency of agent-environment interactions improves research agent performance in a manner analogous to scaling model size and context length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that interactive scaling, achieved by reinforcement learning to handle extended sequences of agent-environment exchanges, enables efficient multi-turn reasoning and information-seeking workflows. This third scaling axis, alongside model capacity and context windows, leads to substantial accuracy gains across representative research benchmarks, with the largest variant approaching the performance of advanced commercial agents.
What carries the argument
The reinforcement learning process that trains the model for deeper and more frequent interactions, allowing sustained reasoning chains that leverage external feedback to correct errors.
If this is right
- Performance on research tasks improves predictably with greater interaction depth and frequency.
- Open-source agents can achieve results competitive with commercial systems through this approach.
- Interactive scaling operates in tandem with model size and context length scaling.
- Complex real-world workflows become feasible with hundreds of tool calls per task.
Where Pith is reading between the lines
- Similar interaction scaling techniques could be tested on other domains like software engineering or scientific discovery agents.
- Investigating the optimal balance between interaction depth and computational cost would be a natural next step.
- The approach highlights the value of environment feedback in mitigating issues with long reasoning chains that affect isolated test-time scaling.
- Community replication on different base models could validate the generality of the scaling observation.
Load-bearing premise
Gains on the evaluated benchmarks arise chiefly from the interactive scaling rather than unmentioned differences in training data or evaluation setups.
What would settle it
A controlled ablation study that restricts interaction depth while holding model size and context length constant and measures whether accuracy improvements disappear.
read the original abstract
We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MiroThinker v1.0, an open-source research agent that uses reinforcement learning to train for deeper and more frequent agent-environment interactions (up to 600 tool calls in a 256K context window). It reports benchmark accuracies for the 72B variant of 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, claiming these surpass prior open-source agents and approach commercial systems, while demonstrating that research performance improves predictably with interaction depth as a third scaling dimension analogous to model size and context length.
Significance. If the reported gains can be shown to arise specifically from controlled variation in interaction depth rather than differences in training data, reward design, or evaluation protocols, the work would establish interaction scaling as a viable new axis for open-source tool-augmented agents. The concrete benchmark numbers and the emphasis on environment feedback correcting long reasoning chains would be a useful empirical contribution to the literature on scaling laws for agents.
major comments (2)
- [Abstract] Abstract: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.
- [Abstract] Abstract (benchmark results): The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.
minor comments (1)
- The manuscript should include a dedicated section or table that lists the precise differences in training data, reward formulation, and evaluation setup relative to the strongest prior open-source baselines cited.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments. We provide point-by-point responses and indicate planned revisions to address the concerns about isolating interaction scaling and detailing benchmark evaluations.
read point-by-point responses
-
Referee: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.
Authors: We thank the referee for this observation. Our RL training procedure is explicitly aimed at scaling interaction depth by rewarding trajectories with successful multi-turn tool interactions and environment feedback utilization. The analysis in the paper shows consistent performance gains as the number of interactions increases. To strengthen the isolation, we will add controlled ablations in the revision that fix the model, context window, and training setup while varying the maximum allowed interaction depth or tool call budget. revision: yes
-
Referee: The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.
Authors: We acknowledge the need for greater transparency in evaluation. The revised manuscript will include details on the number of runs, variance measures, statistical significance where relevant, and full descriptions of the evaluation protocols, including how tool calls are handled and success criteria are applied. This will help demonstrate the robustness of the results and the role of interactive scaling. revision: yes
Circularity Check
No significant circularity; empirical benchmark results after RL training
full rationale
The paper's central claim rests on empirical observations that performance on GAIA, HLE, BrowseComp, and BrowseComp-ZH improves with greater interaction depth and frequency after reinforcement learning for multi-turn tool use. No equations, fitted parameters, or self-referential predictions are invoked that would reduce the reported accuracies or scaling analogy to quantities defined in terms of themselves. The analysis of interaction scaling is presented as a post-training measurement against external benchmarks rather than a derivation that loops back to its inputs. Self-citations, if present, are not load-bearing for the core result, which remains falsifiable via independent replication on the same benchmarks. This is a standard empirical scaling study without the circular patterns enumerated in the guidelines.
Axiom & Free-Parameter Ledger
free parameters (1)
- interaction depth / tool call budget
axioms (1)
- domain assumption Environment feedback from tool calls reliably corrects reasoning errors in multi-turn trajectories.
Forward citations
Cited by 12 Pith papers
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
-
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
-
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...
Reference graph
Works this paper leans on
-
[1]
Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025
OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025
work page 2025
-
[2]
Kimi K2: Open Agentic Intelligence
Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Minimax m2 & agent: Ingenious in simplicity
MiniMax AI. Minimax m2 & agent: Ingenious in simplicity. https://www.minimax.io/news/ minimax-m2, 2025
work page 2025
-
[4]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng,ChenyuZhang,ChongRuan,etal. Deepseek-v3technicalreport.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. 14 MiroThinker v1.0 Technical Report
work page 2025
-
[8]
OpenAI. Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025
work page 2025
-
[9]
Claude takes research to new places.https://claude.com/blog/research, 2025
Anthropic. Claude takes research to new places.https://claude.com/blog/research, 2025
work page 2025
-
[10]
Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025
Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025
-
[11]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, XinyuWang, YongJiang, etal. Webshaper: Agenticallydatasynthesizingviainformation-seeking formalization.arXiv preprint arXiv:2507.15061, 2025
-
[15]
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, et al. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl.arXiv preprint arXiv:2508.13167, 2025
-
[17]
Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, et al. Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025
-
[18]
Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025
Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025
-
[19]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025
Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025. 15 MiroThinker v1.0 Technical Report
-
[22]
Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025
Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, et al. Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025
-
[23]
Kimi-researcher: End-to-end rl training for emerging agentic capabilities
Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io/Kimi-Researcher/, 2025
work page 2025
-
[24]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025
work page 2025
-
[25]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025
work page internal anchor Pith review arXiv 2025
-
[27]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[29]
Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025
xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025
work page 2025
-
[30]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[31]
MiroMind AI Team. Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025
work page 2025
-
[32]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[34]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018
work page 2018
-
[35]
Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025. 16 MiroThinker v1.0 Technical Report
-
[36]
Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025
-
[37]
Taskcraft: Automated generation of agentic tasks
Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025
-
[38]
Khai Mai. Qa-expert-multi-hop-qa-v1.0. https://huggingface.co/datasets/khaimaitien/ qa-expert-multi-hop-qa-V1.0, 2023
work page 2023
-
[39]
Onegen-traindataset-multihopqa
ZJUNLP. Onegen-traindataset-multihopqa. https://huggingface.co/datasets/zjunlp/ OneGen-TrainDataset-MultiHopQA, 2024
work page 2024
-
[40]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060, 2020
work page internal anchor Pith review arXiv 2011
-
[41]
Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023
-
[42]
Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda. Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179, 2025
-
[43]
Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, and Xiangang Li. Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025
-
[44]
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
AartiBasant, AbhijitKhairnar, AbhijitPaithankar, AbhinavKhattar, AdithyaRenduchintala, AdityaMalte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023
work page 2023
-
[46]
Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer
Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran Wang. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. Advances in Neural Information Processing Systems, 37:138663–138697, 2024
work page 2024
-
[47]
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Introducing openai o3 and o4-mini
OpenAI. Introducing openai o3 and o4-mini. https://openai.com/zh-Hans-CN/index/ introducing-o3-and-o4-mini/, 2025. 17 MiroThinker v1.0 Technical Report
work page 2025
-
[50]
Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents.arXiv preprint arXiv:2509.06283, 2025
-
[51]
xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025
Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025
-
[52]
Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation
Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...
work page 2025
-
[53]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025. 18 MiroThinker v1.0 Technical Report A Contributions The listing of authors is in alphabetical order based on their last names. MiroMind Team Song Bai Lidong Bi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.