pith. sign in

arxiv: 2511.11793 · v3 · submitted 2025-11-14 · 💻 cs.CL

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Pith reviewed 2026-05-17 21:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords research agentsinteractive scalingtool-augmented reasoningreinforcement learningGAIA benchmarkmulti-turn interactionsopen-source models
0
0 comments X

The pith

Scaling the depth and frequency of agent-environment interactions improves research agent performance in a manner analogous to scaling model size and context length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes interaction scaling as a third performance dimension for open-source research agents by training models to manage deeper and more frequent tool calls and environment feedback. Through reinforcement learning, the agent learns to sustain up to 600 interactions within a 256K context, allowing error correction and trajectory refinement during complex tasks. Results on GAIA, HLE, BrowseComp, and BrowseComp-ZH show the 72B model reaching 81.9%, 37.7%, 47.1%, and 55.6% accuracy, exceeding prior open-source systems. A reader would care because this provides a complementary path to higher capability without exclusive reliance on larger models or longer contexts alone. The analysis demonstrates that performance gains follow predictable scaling laws with increased interaction depth.

Core claim

The central discovery is that interactive scaling, achieved by reinforcement learning to handle extended sequences of agent-environment exchanges, enables efficient multi-turn reasoning and information-seeking workflows. This third scaling axis, alongside model capacity and context windows, leads to substantial accuracy gains across representative research benchmarks, with the largest variant approaching the performance of advanced commercial agents.

What carries the argument

The reinforcement learning process that trains the model for deeper and more frequent interactions, allowing sustained reasoning chains that leverage external feedback to correct errors.

If this is right

  • Performance on research tasks improves predictably with greater interaction depth and frequency.
  • Open-source agents can achieve results competitive with commercial systems through this approach.
  • Interactive scaling operates in tandem with model size and context length scaling.
  • Complex real-world workflows become feasible with hundreds of tool calls per task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interaction scaling techniques could be tested on other domains like software engineering or scientific discovery agents.
  • Investigating the optimal balance between interaction depth and computational cost would be a natural next step.
  • The approach highlights the value of environment feedback in mitigating issues with long reasoning chains that affect isolated test-time scaling.
  • Community replication on different base models could validate the generality of the scaling observation.

Load-bearing premise

Gains on the evaluated benchmarks arise chiefly from the interactive scaling rather than unmentioned differences in training data or evaluation setups.

What would settle it

A controlled ablation study that restricts interaction depth while holding model size and context length constant and measures whether accuracy improvements disappear.

read the original abstract

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MiroThinker v1.0, an open-source research agent that uses reinforcement learning to train for deeper and more frequent agent-environment interactions (up to 600 tool calls in a 256K context window). It reports benchmark accuracies for the 72B variant of 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, claiming these surpass prior open-source agents and approach commercial systems, while demonstrating that research performance improves predictably with interaction depth as a third scaling dimension analogous to model size and context length.

Significance. If the reported gains can be shown to arise specifically from controlled variation in interaction depth rather than differences in training data, reward design, or evaluation protocols, the work would establish interaction scaling as a viable new axis for open-source tool-augmented agents. The concrete benchmark numbers and the emphasis on environment feedback correcting long reasoning chains would be a useful empirical contribution to the literature on scaling laws for agents.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.
  2. [Abstract] Abstract (benchmark results): The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.
minor comments (1)
  1. The manuscript should include a dedicated section or table that lists the precise differences in training data, reward formulation, and evaluation setup relative to the strongest prior open-source baselines cited.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments. We provide point-by-point responses and indicate planned revisions to address the concerns about isolating interaction scaling and detailing benchmark evaluations.

read point-by-point responses
  1. Referee: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.

    Authors: We thank the referee for this observation. Our RL training procedure is explicitly aimed at scaling interaction depth by rewarding trajectories with successful multi-turn tool interactions and environment feedback utilization. The analysis in the paper shows consistent performance gains as the number of interactions increases. To strengthen the isolation, we will add controlled ablations in the revision that fix the model, context window, and training setup while varying the maximum allowed interaction depth or tool call budget. revision: yes

  2. Referee: The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.

    Authors: We acknowledge the need for greater transparency in evaluation. The revised manuscript will include details on the number of runs, variance measures, statistical significance where relevant, and full descriptions of the evaluation protocols, including how tool calls are handled and success criteria are applied. This will help demonstrate the robustness of the results and the role of interactive scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results after RL training

full rationale

The paper's central claim rests on empirical observations that performance on GAIA, HLE, BrowseComp, and BrowseComp-ZH improves with greater interaction depth and frequency after reinforcement learning for multi-turn tool use. No equations, fitted parameters, or self-referential predictions are invoked that would reduce the reported accuracies or scaling analogy to quantities defined in terms of themselves. The analysis of interaction scaling is presented as a post-training measurement against external benchmarks rather than a derivation that loops back to its inputs. Self-citations, if present, are not load-bearing for the core result, which remains falsifiable via independent replication on the same benchmarks. This is a standard empirical scaling study without the circular patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of RL for agent training and the validity of the chosen benchmarks as proxies for research capability; no new physical or mathematical axioms are introduced.

free parameters (1)
  • interaction depth / tool call budget
    Maximum number of tool calls (up to 600) is a design choice that directly affects measured performance.
axioms (1)
  • domain assumption Environment feedback from tool calls reliably corrects reasoning errors in multi-turn trajectories.
    Invoked when claiming interactive scaling improves trajectories without degradation.

pith-pipeline@v0.9.0 · 5767 in / 1192 out tokens · 40804 ms · 2026-05-17T21:40:52.826466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.

  2. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

    cs.AI 2026-05 unverdicted novelty 6.0

    SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

  3. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

  4. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.

  5. CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

    cs.LG 2026-05 unverdicted novelty 6.0

    CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...

  6. HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.

  7. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...

  8. DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.

  9. Mind DeepResearch Technical Report

    cs.AI 2026-04 unverdicted novelty 5.0

    MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

  10. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  11. AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

    cs.AI 2026-04 unverdicted novelty 5.0

    AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

  12. PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

    cs.AI 2026-04 unverdicted novelty 4.0

    PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 10 Pith papers · 19 internal anchors

  1. [1]

    Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

  2. [2]

    Kimi K2: Open Agentic Intelligence

    Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  3. [3]

    Minimax m2 & agent: Ingenious in simplicity

    MiniMax AI. Minimax m2 & agent: Ingenious in simplicity. https://www.minimax.io/news/ minimax-m2, 2025

  4. [4]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  5. [5]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng,ChenyuZhang,ChongRuan,etal. Deepseek-v3technicalreport.arXiv preprint arXiv:2412.19437, 2024

  6. [6]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  7. [7]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. 14 MiroThinker v1.0 Technical Report

  8. [8]

    Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

    OpenAI. Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

  9. [9]

    Claude takes research to new places.https://claude.com/blog/research, 2025

    Anthropic. Claude takes research to new places.https://claude.com/blog/research, 2025

  10. [10]

    Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

    Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

  11. [11]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  12. [12]

    WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

  13. [13]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  14. [14]

    Webshaper: Agentically datasynthesizingviainformation-seekingformalization.arXivpreprintarXiv:2507.15061,2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, XinyuWang, YongJiang, etal. Webshaper: Agenticallydatasynthesizingviainformation-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

  15. [15]

    Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

  16. [16]

    Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

    Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, et al. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl.arXiv preprint arXiv:2508.13167, 2025

  17. [17]

    Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

    Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, et al. Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

  18. [18]

    Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025

  19. [19]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

  20. [20]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

  21. [21]

    Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

    Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025. 15 MiroThinker v1.0 Technical Report

  22. [22]

    Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

    Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, et al. Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

  23. [23]

    Kimi-researcher: End-to-end rl training for emerging agentic capabilities

    Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io/Kimi-Researcher/, 2025

  24. [24]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

  25. [25]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  26. [26]

    BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

  27. [27]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  28. [28]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  29. [29]

    Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

    xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

  30. [30]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022

  31. [31]

    Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

    MiroMind AI Team. Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

  32. [32]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  33. [33]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  34. [34]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

  35. [35]

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025. 16 MiroThinker v1.0 Technical Report

  36. [36]

    Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

    Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

  37. [37]

    Taskcraft: Automated generation of agentic tasks

    Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

  38. [38]

    Qa-expert-multi-hop-qa-v1.0

    Khai Mai. Qa-expert-multi-hop-qa-v1.0. https://huggingface.co/datasets/khaimaitien/ qa-expert-multi-hop-qa-V1.0, 2023

  39. [39]

    Onegen-traindataset-multihopqa

    ZJUNLP. Onegen-traindataset-multihopqa. https://huggingface.co/datasets/zjunlp/ OneGen-TrainDataset-MultiHopQA, 2024

  40. [40]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060, 2020

  41. [41]

    Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

    Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

  42. [42]

    Toucan: Synthesizing 1.5 m tool-agentic data from real- world mcp environments.arXiv preprint arXiv:2510.01179, 2025

    Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda. Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179, 2025

  43. [43]

    Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

    Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, and Xiangang Li. Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

  44. [44]

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    AartiBasant, AbhijitKhairnar, AbhijitPaithankar, AbhinavKhattar, AdithyaRenduchintala, AdityaMalte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

  45. [45]

    Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  46. [46]

    Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

    Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran Wang. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. Advances in Neural Information Processing Systems, 37:138663–138697, 2024

  47. [47]

    Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024

  48. [48]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  49. [49]

    Introducing openai o3 and o4-mini

    OpenAI. Introducing openai o3 and o4-mini. https://openai.com/zh-Hans-CN/index/ introducing-o3-and-o4-mini/, 2025. 17 MiroThinker v1.0 Technical Report

  50. [50]

    Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents, 2025

    Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents.arXiv preprint arXiv:2509.06283, 2025

  51. [51]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

  52. [52]

    Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  53. [53]

    SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

    Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025. 18 MiroThinker v1.0 Technical Report A Contributions The listing of authors is in alphabetical order based on their last names. MiroMind Team Song Bai Lidong Bi...