MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Bin Wang; Carson Chen; Chenxia Han; Chenxin Tao; Chenyu Yang; Gen Luo; Guanzheng Chen; Hai Ye; Haonan Wang; James Wang

arxiv: 2511.11793 · v3 · submitted 2025-11-14 · 💻 cs.CL

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team , Song Bai , Lidong Bing , Carson Chen , Guanzheng Chen , Yuntao Chen , Zhe Chen , Ziyi Chen

show 47 more authors

Jifeng Dai Xuan Dong Wenhan Dou Yue Deng Yunjie Fu Junqi Ge Chenxia Han Tammy Huang Zhenhang Huang Jerry Jiao Shilei Jiang Tianyu Jiao Xiaoqi Jian Lei Lei Ruilin Li Gen Luo Tiantong Li Xiang Lin Ziyuan Liu Zhiqi Li Jie Ni Qiang Ren Pax Sun Shiqian Su Chenxin Tao Bin Wang Wenhai Wang Haonan Wang James Wang Jin Wang Jojo Wang Letian Wang Shizun Wang Weizhi Wang Zixuan Wang Jinfan Xu Sen Xing Chenyu Yang Hai Ye Jiaheng Yu Yue Yu Muyan Zhong Tianchen Zhao Xizhou Zhu Yanpeng Zhou Yifan Zhang Zhi Zhu

This is my paper

Pith reviewed 2026-05-17 21:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords research agentsinteractive scalingtool-augmented reasoningreinforcement learningGAIA benchmarkmulti-turn interactionsopen-source models

0 comments

The pith

Scaling the depth and frequency of agent-environment interactions improves research agent performance in a manner analogous to scaling model size and context length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes interaction scaling as a third performance dimension for open-source research agents by training models to manage deeper and more frequent tool calls and environment feedback. Through reinforcement learning, the agent learns to sustain up to 600 interactions within a 256K context, allowing error correction and trajectory refinement during complex tasks. Results on GAIA, HLE, BrowseComp, and BrowseComp-ZH show the 72B model reaching 81.9%, 37.7%, 47.1%, and 55.6% accuracy, exceeding prior open-source systems. A reader would care because this provides a complementary path to higher capability without exclusive reliance on larger models or longer contexts alone. The analysis demonstrates that performance gains follow predictable scaling laws with increased interaction depth.

Core claim

The central discovery is that interactive scaling, achieved by reinforcement learning to handle extended sequences of agent-environment exchanges, enables efficient multi-turn reasoning and information-seeking workflows. This third scaling axis, alongside model capacity and context windows, leads to substantial accuracy gains across representative research benchmarks, with the largest variant approaching the performance of advanced commercial agents.

What carries the argument

The reinforcement learning process that trains the model for deeper and more frequent interactions, allowing sustained reasoning chains that leverage external feedback to correct errors.

If this is right

Performance on research tasks improves predictably with greater interaction depth and frequency.
Open-source agents can achieve results competitive with commercial systems through this approach.
Interactive scaling operates in tandem with model size and context length scaling.
Complex real-world workflows become feasible with hundreds of tool calls per task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar interaction scaling techniques could be tested on other domains like software engineering or scientific discovery agents.
Investigating the optimal balance between interaction depth and computational cost would be a natural next step.
The approach highlights the value of environment feedback in mitigating issues with long reasoning chains that affect isolated test-time scaling.
Community replication on different base models could validate the generality of the scaling observation.

Load-bearing premise

Gains on the evaluated benchmarks arise chiefly from the interactive scaling rather than unmentioned differences in training data or evaluation setups.

What would settle it

A controlled ablation study that restricts interaction depth while holding model size and context length constant and measures whether accuracy improvements disappear.

read the original abstract

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiroThinker gets solid open-source benchmark numbers by training for longer tool-use chains, but the isolation of interaction scaling from other training choices remains unclear.

read the letter

Hey, the main point is that this team trained a 72B open model with RL to sustain up to 600 tool calls inside a 256K context and posted competitive scores: 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on the Chinese variant. Those numbers beat earlier open agents and sit close to some closed systems on the same tasks. They also report that performance keeps rising as interaction depth and frequency increase, which they present as a third scaling axis next to model size and context length. Releasing the model and showing the practical payoff on real research benchmarks is the useful part here. It gives people working on tool-augmented agents something concrete to try and build on. The framing of interaction as a controllable, trainable dimension is straightforward and worth testing further. The soft spot is the missing isolation. The gains are shown after RL, but the write-up does not include tight ablations that fix model size, context window, and base data while varying only the interaction budget or reward for multi-turn behavior. Without those controls it is difficult to say how much of the lift comes specifically from deeper interactions versus differences in training data, reward design, or evaluation setup. Statistical details on variance across runs are also light. This paper is aimed at researchers who build and scale research agents in open settings. Anyone tracking progress on GAIA-style tasks or multi-turn tool use will find the numbers and the interaction focus worth looking at, even if the causal story needs more support. It deserves a serious referee because the empirical results are competitive and the direction is practical enough that reviewers can push for the missing controls and ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MiroThinker v1.0, an open-source research agent that uses reinforcement learning to train for deeper and more frequent agent-environment interactions (up to 600 tool calls in a 256K context window). It reports benchmark accuracies for the 72B variant of 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, claiming these surpass prior open-source agents and approach commercial systems, while demonstrating that research performance improves predictably with interaction depth as a third scaling dimension analogous to model size and context length.

Significance. If the reported gains can be shown to arise specifically from controlled variation in interaction depth rather than differences in training data, reward design, or evaluation protocols, the work would establish interaction scaling as a viable new axis for open-source tool-augmented agents. The concrete benchmark numbers and the emphasis on environment feedback correcting long reasoning chains would be a useful empirical contribution to the literature on scaling laws for agents.

major comments (2)

[Abstract] Abstract: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.
[Abstract] Abstract (benchmark results): The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.

minor comments (1)

The manuscript should include a dedicated section or table that lists the precise differences in training data, reward formulation, and evaluation setup relative to the strongest prior open-source baselines cited.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments. We provide point-by-point responses and indicate planned revisions to address the concerns about isolating interaction scaling and detailing benchmark evaluations.

read point-by-point responses

Referee: The central claim that 'interaction depth exhibits scaling behaviors analogous to model size and context length' and that performance 'improves predictably' with deeper interactions requires explicit isolation of the interaction variable. The description of RL training for multi-turn tool use does not specify whether the 72B model was compared against ablations that hold model size, context length, and base capabilities fixed while varying only tool-call budget or interaction frequency; without such controls the analogy remains unproven.

Authors: We thank the referee for this observation. Our RL training procedure is explicitly aimed at scaling interaction depth by rewarding trajectories with successful multi-turn tool interactions and environment feedback utilization. The analysis in the paper shows consistent performance gains as the number of interactions increases. To strengthen the isolation, we will add controlled ablations in the revision that fix the model, context window, and training setup while varying the maximum allowed interaction depth or tool call budget. revision: yes
Referee: The reported accuracies (81.9% GAIA, 37.7% HLE, etc.) are presented without accompanying details on statistical significance, number of runs, variance, or exact evaluation protocols. This makes it impossible to assess whether the gains over prior open-source agents are robust or could be explained by differences in data curation or reward signals rather than interactive scaling.

Authors: We acknowledge the need for greater transparency in evaluation. The revised manuscript will include details on the number of runs, variance measures, statistical significance where relevant, and full descriptions of the evaluation protocols, including how tool calls are handled and success criteria are applied. This will help demonstrate the robustness of the results and the role of interactive scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results after RL training

full rationale

The paper's central claim rests on empirical observations that performance on GAIA, HLE, BrowseComp, and BrowseComp-ZH improves with greater interaction depth and frequency after reinforcement learning for multi-turn tool use. No equations, fitted parameters, or self-referential predictions are invoked that would reduce the reported accuracies or scaling analogy to quantities defined in terms of themselves. The analysis of interaction scaling is presented as a post-training measurement against external benchmarks rather than a derivation that loops back to its inputs. Self-citations, if present, are not load-bearing for the core result, which remains falsifiable via independent replication on the same benchmarks. This is a standard empirical scaling study without the circular patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of RL for agent training and the validity of the chosen benchmarks as proxies for research capability; no new physical or mathematical axioms are introduced.

free parameters (1)

interaction depth / tool call budget
Maximum number of tool calls (up to 600) is a design choice that directly affects measured performance.

axioms (1)

domain assumption Environment feedback from tool calls reliably corrects reasoning errors in multi-turn trajectories.
Invoked when claiming interactive scaling improves trajectories without degradation.

pith-pipeline@v0.9.0 · 5767 in / 1192 out tokens · 40804 ms · 2026-05-17T21:40:52.826466+00:00 · methodology

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
cs.LG 2026-05 unverdicted novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
cs.LG 2026-04 unverdicted novelty 6.0

A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
cs.AI 2026-04 unverdicted novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 10 Pith papers · 19 internal anchors

[1]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

work page 2025
[2]

Kimi K2: Open Agentic Intelligence

Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Minimax m2 & agent: Ingenious in simplicity

MiniMax AI. Minimax m2 & agent: Ingenious in simplicity. https://www.minimax.io/news/ minimax-m2, 2025

work page 2025
[4]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng,ChenyuZhang,ChongRuan,etal. Deepseek-v3technicalreport.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. 14 MiroThinker v1.0 Technical Report

work page 2025
[8]

Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

OpenAI. Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

work page 2025
[9]

Claude takes research to new places.https://claude.com/blog/research, 2025

Anthropic. Claude takes research to new places.https://claude.com/blog/research, 2025

work page 2025
[10]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025
[11]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Webshaper: Agentically datasynthesizingviainformation-seekingformalization.arXivpreprintarXiv:2507.15061,2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, XinyuWang, YongJiang, etal. Webshaper: Agenticallydatasynthesizingviainformation-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

work page arXiv 2025
[15]

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, et al. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl.arXiv preprint arXiv:2508.13167, 2025

work page arXiv 2025
[17]

Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, et al. Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

work page arXiv 2025
[18]

Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025
[19]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025. 15 MiroThinker v1.0 Technical Report

work page arXiv 2025
[22]

Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, et al. Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

work page arXiv 2025
[23]

Kimi-researcher: End-to-end rl training for emerging agentic capabilities

Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io/Kimi-Researcher/, 2025

work page 2025
[24]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

work page 2025
[25]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

work page internal anchor Pith review arXiv 2025
[27]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[29]

Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

work page 2025
[30]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[31]

Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

MiroMind AI Team. Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

work page 2025
[32]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

work page 2018
[35]

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025. 16 MiroThinker v1.0 Technical Report

work page arXiv 2025
[36]

Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

work page arXiv 2025
[37]

Taskcraft: Automated generation of agentic tasks

Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

work page arXiv 2025
[38]

Qa-expert-multi-hop-qa-v1.0

Khai Mai. Qa-expert-multi-hop-qa-v1.0. https://huggingface.co/datasets/khaimaitien/ qa-expert-multi-hop-qa-V1.0, 2023

work page 2023
[39]

Onegen-traindataset-multihopqa

ZJUNLP. Onegen-traindataset-multihopqa. https://huggingface.co/datasets/zjunlp/ OneGen-TrainDataset-MultiHopQA, 2024

work page 2024
[40]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060, 2020

work page internal anchor Pith review arXiv 2011
[41]

Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

work page arXiv 2023
[42]

Toucan: Synthesizing 1.5 m tool-agentic data from real- world mcp environments.arXiv preprint arXiv:2510.01179, 2025

Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda. Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179, 2025

work page arXiv 2025
[43]

Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, and Xiangang Li. Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

work page arXiv 2025
[44]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

AartiBasant, AbhijitKhairnar, AbhijitPaithankar, AbhinavKhattar, AdithyaRenduchintala, AdityaMalte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

work page internal anchor Pith review arXiv 2025
[45]

Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[46]

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran Wang. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. Advances in Neural Information Processing Systems, 37:138663–138697, 2024

work page 2024
[47]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/zh-Hans-CN/index/ introducing-o3-and-o4-mini/, 2025. 17 MiroThinker v1.0 Technical Report

work page 2025
[50]

Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents, 2025

Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents.arXiv preprint arXiv:2509.06283, 2025

work page arXiv 2025
[51]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

work page arXiv 2025
[52]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page 2025
[53]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025. 18 MiroThinker v1.0 Technical Report A Contributions The listing of authors is in alphabetical order based on their last names. MiroMind Team Song Bai Lidong Bi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

work page 2025

[2] [2]

Kimi K2: Open Agentic Intelligence

Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Minimax m2 & agent: Ingenious in simplicity

MiniMax AI. Minimax m2 & agent: Ingenious in simplicity. https://www.minimax.io/news/ minimax-m2, 2025

work page 2025

[4] [4]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng,ChenyuZhang,ChongRuan,etal. Deepseek-v3technicalreport.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. 14 MiroThinker v1.0 Technical Report

work page 2025

[8] [8]

Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

OpenAI. Introducing chatgpt agent: bridging research and action.https://openai.com/index/ introducing-chatgpt-agent/, 2025

work page 2025

[9] [9]

Claude takes research to new places.https://claude.com/blog/research, 2025

Anthropic. Claude takes research to new places.https://claude.com/blog/research, 2025

work page 2025

[10] [10]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025

[11] [11]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Webshaper: Agentically datasynthesizingviainformation-seekingformalization.arXivpreprintarXiv:2507.15061,2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, XinyuWang, YongJiang, etal. Webshaper: Agenticallydatasynthesizingviainformation-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

work page arXiv 2025

[15] [15]

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, et al. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl.arXiv preprint arXiv:2508.13167, 2025

work page arXiv 2025

[17] [17]

Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, et al. Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276, 2025

work page arXiv 2025

[18] [18]

Webdancer: Towards autonomousinformationseekingagency.arXivpreprintarXiv:2505.22648,2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency. arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025

[19] [19]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025. 15 MiroThinker v1.0 Technical Report

work page arXiv 2025

[22] [22]

Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, et al. Infoagent: Advancing autonomous information-seeking agents.arXiv preprint arXiv:2509.25189, 2025

work page arXiv 2025

[23] [23]

Kimi-researcher: End-to-end rl training for emerging agentic capabilities

Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https: //moonshotai.github.io/Kimi-Researcher/, 2025

work page 2025

[24] [24]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

work page 2025

[25] [25]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

work page internal anchor Pith review arXiv 2025

[27] [27]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[29] [29]

Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

work page 2025

[30] [30]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022

[31] [31]

Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

MiroMind AI Team. Miroflow: A high-performance open-source research agent framework.https: //github.com/MiroMindAI/MiroFlow, 2025

work page 2025

[32] [32]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[34] [34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

work page 2018

[35] [35]

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025. 16 MiroThinker v1.0 Technical Report

work page arXiv 2025

[36] [36]

Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

work page arXiv 2025

[37] [37]

Taskcraft: Automated generation of agentic tasks

Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

work page arXiv 2025

[38] [38]

Qa-expert-multi-hop-qa-v1.0

Khai Mai. Qa-expert-multi-hop-qa-v1.0. https://huggingface.co/datasets/khaimaitien/ qa-expert-multi-hop-qa-V1.0, 2023

work page 2023

[39] [39]

Onegen-traindataset-multihopqa

ZJUNLP. Onegen-traindataset-multihopqa. https://huggingface.co/datasets/zjunlp/ OneGen-TrainDataset-MultiHopQA, 2024

work page 2024

[40] [40]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060, 2020

work page internal anchor Pith review arXiv 2011

[41] [41]

Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, and Edward Choi. Open-wikitable: Dataset for open domain question answering with complex reasoning over table.arXiv preprint arXiv:2305.07288, 2023

work page arXiv 2023

[42] [42]

Toucan: Synthesizing 1.5 m tool-agentic data from real- world mcp environments.arXiv preprint arXiv:2510.01179, 2025

Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda. Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179, 2025

work page arXiv 2025

[43] [43]

Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, and Xiangang Li. Not all correct answers are equal: Why your distillation source matters.arXiv preprint arXiv:2505.14464, 2025

work page arXiv 2025

[44] [44]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

AartiBasant, AbhijitKhairnar, AbhijitPaithankar, AbhinavKhattar, AdithyaRenduchintala, AdityaMalte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

work page internal anchor Pith review arXiv 2025

[45] [45]

Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[46] [46]

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran Wang. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. Advances in Neural Information Processing Systems, 37:138663–138697, 2024

work page 2024

[47] [47]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/zh-Hans-CN/index/ introducing-o3-and-o4-mini/, 2025. 17 MiroThinker v1.0 Technical Report

work page 2025

[50] [50]

Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents, 2025

Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, and Shafiq Joty. Sfr-deepresearch: Towards effective reinforcement learning for autonomously reasoning single agents.arXiv preprint arXiv:2509.06283, 2025

work page arXiv 2025

[51] [51]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

work page arXiv 2025

[52] [52]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page 2025

[53] [53]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025. 18 MiroThinker v1.0 Technical Report A Contributions The listing of authors is in alphabetical order based on their last names. MiroMind Team Song Bai Lidong Bi...

work page internal anchor Pith review Pith/arXiv arXiv 2025